Reflections FINAL PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Distributed Computing Systems

Shen Li @ IBM Research


Agenda
• Overview

• Stream Computing Systems

• Apache Beam (VLDB’15 Dataflow): A Unified Model for Batch and


Stream Processing

• MapReduce (OSDI’04) presented by Huh & Cline

• Spark (NSDI’12) presented by Lin & Chang

• Spark Streaming (SOSP’13) presented by Murali & Zhang


Overview
• Motivation?

Handle Massive Data Lower Cost Reduce Complexity


Overview
• Applications?
Overview
• History

MillWheel

System S Dryad

Pig
MapReduce Hive S4 Stream

2004 2006 2008 2010 2012 2014 2016 2018

Trend? Why? See the problem?


Overview
• Categorization: based on granularity

Dryad

MapReduce

Batch Micro-batch Streaming


Overview

100 events per bundle/micro-batch


Stream Computing Systems

Fusion into PEs Execution

Communications
in PE becomes
Function calls
Master Slave

Parallelism in a PE
Apache Beam
• A unified programming model for both batch and stream computing
applications.

User App Written in Beam

Beam SDK

Dataflow Runner Flink Runner Streams Runner Spark Runner Apex Runner Gearpump

(DataArtisans) (DataBricks) (DataTorrent) (Intel)


Google Dataflow IBM Streams
Flink Spark Apex Gearpump

• Why adopting Beam


Beam may become a language standard for streaming applications
Applications no longer need to make commitment to specific engines
Beam API
• Separate data processing logic from runtime requirements.To write
an app, users need to answer four questions:

• What is being computed?


Window Window
Discard/Accumulate
input
• Where in event time (event
occurs) to create windows?
Trigger
• When in processing time
output
(tuple been processed) to
carry out computation? Computation

• How do refinements relate?


Primitive Transforms
source Creates a stream

Window Defines windowing, triggering, and retracting schemes

Merges multiple input streams with the same tuple


Flatten type into a single output stream

input side input Converts tuples/windows of the input stream into


View user-defined data structures, which can be
consumed by ParDo as side inputs
s

Applies a user-defined DoFn to each tuple in the main


sid

ut
tp
e

input stream and emits one main out stream. Besides,


ou
inp

ParDo
e
ut

it may take multiple side input streams, and generates


sid
s

multiple side output streams


main input main output

(k, v1) (k, v2) (k, [v1, v2]) Groups input values with the same key in the same
GroupByKey window (pane) into the same output tuple
Window Model Comparison
Discrete Windows Continuous Windows
• Each window belongs to a fixed interval • Maintains a single window that
in event time moves along time axis
• After creation, a window never moves • Mobility is achieved by receiving and
• The mobility is achieved by creating, evicting tuples
discarding, and merging windows

1 2 3

event time processing time

Pro? Con?
Lateness
• Time concepts:
1. Event Time: the time when event occurs, recorded by the timestamp in the
tuple

2. Processing Time: the time when the tuple gets processed at the operator in
the pipeline

3. Low Watermark: A local estimation on progress of an operator in event


time.

• It is up to the app source operator and the runner to design the


watermark algorithm. Usually, watermark at an operator is the min
watermark of all upstream operators.
Lateness
• What is late arrival?

- Tuples that arrive with timestamps older than the watermark is considered a
late arrival 7
6
1 2 3 4 5
Processing Time

Watermark

0 1 2 3 4 5 6 7
Event Time
Join Example
Goal: jointly process these data

WindowFn needs to identify tuples in


the target window

1 2 3 4 5 6
1 2 3 4 5 6 Window View
1
2
3
Sid 4
eI 5
np 6
ut

Main Input
a b c d e f Window GBK ParDo

a b c e d f a b c e d f

Window1 Window2 Window3

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy