Reflections FINAL PDF
Reflections FINAL PDF
Reflections FINAL PDF
MillWheel
System S Dryad
Pig
MapReduce Hive S4 Stream
Dryad
MapReduce
Communications
in PE becomes
Function calls
Master Slave
Parallelism in a PE
Apache Beam
• A unified programming model for both batch and stream computing
applications.
Beam SDK
Dataflow Runner Flink Runner Streams Runner Spark Runner Apex Runner Gearpump
ut
tp
e
ParDo
e
ut
(k, v1) (k, v2) (k, [v1, v2]) Groups input values with the same key in the same
GroupByKey window (pane) into the same output tuple
Window Model Comparison
Discrete Windows Continuous Windows
• Each window belongs to a fixed interval • Maintains a single window that
in event time moves along time axis
• After creation, a window never moves • Mobility is achieved by receiving and
• The mobility is achieved by creating, evicting tuples
discarding, and merging windows
1 2 3
Pro? Con?
Lateness
• Time concepts:
1. Event Time: the time when event occurs, recorded by the timestamp in the
tuple
2. Processing Time: the time when the tuple gets processed at the operator in
the pipeline
- Tuples that arrive with timestamps older than the watermark is considered a
late arrival 7
6
1 2 3 4 5
Processing Time
Watermark
0 1 2 3 4 5 6 7
Event Time
Join Example
Goal: jointly process these data
1 2 3 4 5 6
1 2 3 4 5 6 Window View
1
2
3
Sid 4
eI 5
np 6
ut
Main Input
a b c d e f Window GBK ParDo
a b c e d f a b c e d f