waves

Batch vs
Stream

Processing data at rest (SQL, Hadoop) vs processing data in motion (Flink, Kafka Streams). Do you need the answer now, or is tomorrow okay?

Batch Processing (The Tortoise)

Processing a bounded set of data. It has a start and an end.

  • ETL: Extract, Transform, Load (Nightly jobs).
  • MapReduce: Split huge files, process in parallel, combine results.
  • Billing: Generate invoices at the end of the month.
Pros:

High throughput, efficient compression, easy to replay/retry.

archive

"Take all logs from yesterday. Calculate DAU."

Stream Processing (The Hare)

water_drop

"Take this log right now. Update DAU counter."

Processing an unbounded stream of events. It never ends.

  • Fraud Detection: Block card during the swipe.
  • Monitoring: Alert if error rate spikes now.
  • Recommendations: "You just viewed X, buy Y."
Pros:

Low latency, real-time insights.

Lambda vs Kappa Architecture

Lambda (λ)

The Old Way

Batch Layer:Master dataset (Hadoop). Accurate but slow.
Speed Layer:Recent data (Storm). Fast but approximate.
Serving Layer:Merges both views.
Con: Maintaing two codebases (Java for Batch, Scala for Stream) is pain.

Kappa (κ)

The Modern Way

Stream Only:Everything is a stream.

"Batch" is just streaming through historical data quickly.

Uses tools like Apache Flink or Kafka Streams.

Pro: One codebase to rule them all.

Windowing Strategies

Since streams never end, how do you calculate an "average"? You must define a window of time.

Tumbling Window

Fixed size, non-overlapping.

[00:00 - 00:05]
[00:05 - 00:10]
[00:10 - 00:15]

Sliding Window

Fixed size, overlapping. "Last 5 minutes, updated every minute".

Window 1
Window 2

Session Window

Dynamic size based on activity. Ends after inactivity gap.

[User Active...]
Gap
[Active]

Watermarks & Late Data

Event Time vs Processing Time

A mobile user clicks a button at 12:00:00 (Event Time). They go into a tunnel. The server receives the event at 12:05:00 (Processing Time).

The Watermark

A watermark is a heuristic that says "I have likely received all data up to 12:00:00".

If data arrives BEFORE Watermark:Process normally.
If data arrives AFTER Watermark:It is "Late". Drop it? Update the old result? (Configurable).