0

We want to use Apache Flink for the streaming job – read from one Kafka topic and write to another. The infrastructure will be deployed to Kubernetes. I want to restart the job on any PR merge to master branch.

Therefore, I wonder whether Flink guarantees that resubmitting the job will continue the data stream from the last processed message? Because one of the most important job's feature is message deduplication on time window.

What are the patterns of updating streaming jobs for Apache Flink? Should I just stop the old job and submit the new one?

3
  • 1
    Have you checked the starting offset options for the Kafka consumer? If you set the starting offset as the latest committed offset, Flink restart will cause the consumer to resume from the last processed Kafka offset. nightlies.apache.org/flink/flink-docs-release-1.16/docs/… Commented Jan 12, 2023 at 20:03
  • 1
    @Shankar thanks for sharing this topic. Anyway, I also want to understand when Flink Kafka Consumer commits these offsets. Supposing consumer read 1000 records and then I stop the job to resubmit its new version. Does consumer commit the offsets before of after sending processed messages to Kafka sink? Therefore, is there a probability that the previously obtained records shall be marked as committed and get lost on the next job submission? There are savepoints in Flink. So, I can store these records in HDFS and process them after? Am I right? Commented Jan 13, 2023 at 4:20
  • Save points are metadata, not the actual data itself from Kafka, unless you've explicitly told your kafka consumer to write to HDFS, but it still would only resume based on Kafka consumer group itself, rather than the hdfs output Commented Jan 13, 2023 at 15:36

2 Answers 2

0

My suggestion would be to simply try it.

Deploy your app manually and then stop it. Run kafka-consumer-groups script to find your consumer group. Then restart/upgrade the app, and run the command again with the same group. If the lag goes down (as it should), rather than resets to the beginning of the topic, then it's working as expected, as it would for any Kafka consumer.

read from one Kafka topic and write to another.

Ideally, Kafka Streams is used for this.

Sign up to request clarification or add additional context in comments.

Comments

0

Kafka consumer offsets are saved as part of the checkpoint. So as long as your workflow is running in exactly-once mode, and your Kafka source is properly configured (e.g. you've set a group id), then restarting your job from the last checkpoint or savepoint will guarantee no duplicate records in the destination Kafka topic.

If you're stopping/restarting the job as part of your CI processing, then you'd want to:

  1. Stop with savepoint.
  2. Re-start from the savepoint

You could also set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true (so that a checkpoint is taken when the job is terminated), enable externalized checkpoints, and then restart from the last checkpoint, but the savepoint approach is easier.

And you'd want to have some regular process that removes older savepoints, though the size of the savepoints will be very small (only the consumer offsets, for your simple workflow).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.