Real-time streaming pipeline with Apache Flink 2.0, Kafka and Iceberg
It's 2:03 PM. A flash sale just started. In the warehouse, an operator is entering incoming orders into the management system. He types a quantity, makes a mistake, corrects it immediately. Two eve...
Source: dev.to
It's 2:03 PM. A flash sale just started. In the warehouse, an operator is entering incoming orders into the management system. He types a quantity, makes a mistake, corrects it immediately. Two events, one reality. Thirty seconds apart. The batch job that runs at 2 AM will see both. It won't know which one is right. Depending on how the reconciliation logic is written, if it exists at all, it picks one of the two, often non-deterministically. And if the correction falls into the next batch window, the problem doesn't surface right away: the morning's numbers are wrong, cleanly, with no technical error in sight. This is a real and recurring source of data quality problems in data teams. Processing events as they arrive, in order, with their temporal context intact, fundamentally changes how this problem is handled. That's the starting point for this project: an end-to-end streaming pipeline on the Olist e-commerce dataset, built with Apache Flink 2.0, Kafka and Iceberg. The dataset and