Apache Flink Streaming Job: deployment patterns

Question

We want to use Apache Flink for the streaming job – read from one Kafka topic and write to another. The infrastructure will be deployed to Kubernetes. I want to restart the job on any PR merge to master branch.

Therefore, I wonder whether Flink guarantees that resubmitting the job will continue the data stream from the last processed message? Because one of the most important job's feature is message deduplication on time window.

What are the patterns of updating streaming jobs for Apache Flink? Should I just stop the old job and submit the new one?

Have you checked the starting offset options for the Kafka consumer? If you set the starting offset as the latest committed offset, Flink restart will cause the consumer to resume from the last processed Kafka offset. nightlies.apache.org/flink/flink-docs-release-1.16/docs/… — Shankar
– Shankar, Commented Jan 12, 2023 at 20:03
@Shankar thanks for sharing this topic. Anyway, I also want to understand when Flink Kafka Consumer commits these offsets. Supposing consumer read 1000 records and then I stop the job to resubmit its new version. Does consumer commit the offsets before of after sending processed messages to Kafka sink? Therefore, is there a probability that the previously obtained records shall be marked as committed and get lost on the next job submission? There are savepoints in Flink. So, I can store these records in HDFS and process them after? Am I right? — Semyon Kirekov
– Semyon Kirekov, Commented Jan 13, 2023 at 4:20
Save points are metadata, not the actual data itself from Kafka, unless you've explicitly told your kafka consumer to write to HDFS, but it still would only resume based on Kafka consumer group itself, rather than the hdfs output — OneCricketeer
– OneCricketeer, Commented Jan 13, 2023 at 15:36

OneCricketeer · Accepted Answer · 2023-01-13 15:40:01Z

0

My suggestion would be to simply try it.

Deploy your app manually and then stop it. Run kafka-consumer-groups script to find your consumer group. Then restart/upgrade the app, and run the command again with the same group. If the lag goes down (as it should), rather than resets to the beginning of the topic, then it's working as expected, as it would for any Kafka consumer.

read from one Kafka topic and write to another.

Ideally, Kafka Streams is used for this.

answered Jan 13, 2023 at 15:40

OneCricketeer

193k20 gold badges146 silver badges276 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

kkrugler · Accepted Answer · 2023-01-14 17:52:14Z

Kafka consumer offsets are saved as part of the checkpoint. So as long as your workflow is running in exactly-once mode, and your Kafka source is properly configured (e.g. you've set a group id), then restarting your job from the last checkpoint or savepoint will guarantee no duplicate records in the destination Kafka topic.

If you're stopping/restarting the job as part of your CI processing, then you'd want to:

Stop with savepoint.
Re-start from the savepoint

You could also set ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH to true (so that a checkpoint is taken when the job is terminated), enable externalized checkpoints, and then restart from the last checkpoint, but the savepoint approach is easier.

And you'd want to have some regular process that removes older savepoints, though the size of the savepoints will be very small (only the consumer offsets, for your simple workflow).

Collectives™ on Stack Overflow

Apache Flink Streaming Job: deployment patterns

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related