Batch vs
Stream
Processing data at rest (SQL, Hadoop) vs processing data in motion (Flink, Kafka Streams). Do you need the answer now, or is tomorrow okay?
Batch Processing (The Tortoise)
Processing a bounded set of data. It has a start and an end.
- ETL: Extract, Transform, Load (Nightly jobs).
- MapReduce: Split huge files, process in parallel, combine results.
- Billing: Generate invoices at the end of the month.
High throughput, efficient compression, easy to replay/retry.
"Take all logs from yesterday. Calculate DAU."
Stream Processing (The Hare)
"Take this log right now. Update DAU counter."
Processing an unbounded stream of events. It never ends.
- Fraud Detection: Block card during the swipe.
- Monitoring: Alert if error rate spikes now.
- Recommendations: "You just viewed X, buy Y."
Low latency, real-time insights.
Lambda vs Kappa Architecture
Lambda (λ)
The Old Way
Kappa (κ)
The Modern Way
"Batch" is just streaming through historical data quickly.
Uses tools like Apache Flink or Kafka Streams.
Windowing Strategies
Since streams never end, how do you calculate an "average"? You must define a window of time.
Tumbling Window
Fixed size, non-overlapping.
Sliding Window
Fixed size, overlapping. "Last 5 minutes, updated every minute".
Session Window
Dynamic size based on activity. Ends after inactivity gap.
Watermarks & Late Data
Event Time vs Processing Time
A mobile user clicks a button at 12:00:00 (Event Time). They go into a tunnel. The server receives the event at 12:05:00 (Processing Time).
The Watermark
A watermark is a heuristic that says "I have likely received all data up to 12:00:00".