I have a Kafka Streams app that takes input from a Kafka Topic, aggregates it on three fields in the original's value in 5 minute windows. On the output side of this, I need to translate the aggregated message (where key = complex object, value = the aggregate) into a message where key = one of the fields in the complex object, and the value is the complex object + aggregate into a new field. Finally that output is persisted to an Iceberg table via the Iceberg Kafka Connector. The translation is required because of this part, Iceberg connector can only persist the value object not the key.
When the data hits the Iceberg connector, I'm seeing duplicate entries, e.g. 4 or 6 values for the same time window. I've ruled out late messages as the cause, as I've verified this inconsistent output exists in test environments so I'm certain it's a bug in my code. The duplication is happening before it hits iceberg.
I'm new when it comes to Kafka Streams. My understanding was that RocksDB is used as an intermediate store and there's several state stores managed within additional Kafka topics. The result being that the final message isn't written to the destination topic until the window is elapsed (and the grace period elapsed).
What possible reasons would cause duplicate messages getting outputted?