Slide 7 Spark Introduction
Slide 7 Spark Introduction
Big Data
(Spark)
- U
l ab Instructor: Trong-Hop Do
November 24th 2020
S3 S3Lab
Smart Software System Laboratory
1
I T
- U
“Big data is at the foundation of all the
b
megatrends that are happening today, from
a
social to mobile to cloud to gaming.”
l
– Chris Lynch, Vertica Systems
Big Data
S 3 2
Introduction
I T
- U
● Apache Spark is an open-source cluster computing framework for real-
time processing
b
● Spark provides an interface for programming entire clusters with implicit
l a
data parallelism and fault-tolerance.
● Spark is designed to cover a wide range of workloads such as batch
3
applications, iterative algorithms, interactive queries and streaming
Big Data
S 3
Introduction
I T
- U
l ab
Big Data
S 3 4
Introduction
I T
- U
l ab
S 3 5
Applications
I T
- U
l ab
Big Data
S 3 6
Components
I T
- U
● Spark Core and Resilient Distributed Datasets or RDDs
● Spark SQL
b
● Spark Streaming
l a
● Machine Learning Library or MLlib
3
● GraphX
Big Data
S 7
Components
I T
- U
l ab
Big Data
S 3 8
Components
I T
U
Spark Core
-
● The base engine for large-scale parallel and distributed data processing.
The core is the distributed execution engine and the Java, Scala, and
b
Python APIs offer a platform for distributed ETL application development.
l a
Further, additional libraries which are built atop the core allow diverse
3
workloads for streaming, SQL, and machine learning. It is responsible for:
○ Memory management and fault recovery
S
○ Scheduling, distributing and monitoring jobs on a cluster
○ Interacting with storage systems
9
Big Data
Components
I T
U
Spark Streaming
-
● Used to process real-time streaming data.
● It enables high-throughput and fault-tolerant stream processing of live
b
data streams. The fundamental stream unit is DStream which is basically
l a
a series of RDDs (Resilient Distributed Datasets) to process the real-time
3
data.
Big Data
S 10
Components
I T
U
Spark Streaming - Workflow
b -
3 l a
Big Data
S 11
Components
I T
U
Spark Streaming - Fundamentals
-
● Streaming Context
● DStream (Discretized Stream)
b
● Caching
l a
● Accumulators, Broadcast Variables and Checkpoints
Big Data
S 3 12
Components
I T
U
Spark Streaming - Fundamentals
b -
3 l a
Big Data
S 13
Components
I T
U
Spark Streaming - Fundamentals
b -
3 l a
Big Data
S 14
Components
I T
U
Spark Streaming - Fundamentals
b -
3 l a
Big Data
S 15
Components
I T
U
Spark Streaming - Fundamentals
b -
3 l a
Big Data
S 16
Components
I T
U
Spark SQL
-
● Spark SQL integrates relational processing with Spark functional
programming. Provides support for various data sources and makes it
b
possible to weave SQL queries with code transformations thus resulting in
l a
a very powerful tool. The following are the four libraries of Spark SQL.
3
○ Data Source API
○ DataFrame API
S
○ Interpreter & Optimizer
○ SQL Service
17
Big Data
Components
I T
U
Spark SQL
b -
3 l a
Big Data
S 18
Components
I T
U
Spark SQL - Data Source API
-
● Universal API for loading and storing structured data.
○ Built in support for Hive, JSON, Avro, JDBC, Parquet, ect.
b
○ Support third party integration through spark packages
○ Support for smart sources
a
○ Data Abstraction and Domain Specific Language (DSL) applicable on structure and semi-
l
structured data
○ Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage
3
systems (HDFS, HIVE Tables, MySQL, etc.)
○ Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to
S
○
multi-node clusters.
19
Big Data
Components
I T
U
Spark SQL - DataFrame API
-
● A Data Frame is a distributed collection of data organized into named
column. It is equivalent to a relational table in SQL used for storing data
b
into tables.
3 l a
Big Data
S 20
Components
I T
U
Spark SQL - SQL Interpreter And Optimizer
-
● based on functional programming constructed in Scala.
○ provides a general framework for transforming trees, which is used to perform
b
analysis/evaluation, optimization, planning, and run time code spawning.
○ This supports cost based optimization (run time and resource utilization is termed as cost)
a
and rule based optimization, making queries run much faster than their RDD (Resilient
l
Distributed Dataset) counterparts.
Big Data
S 3 21
Components
I T
U
Spark SQL - SQL Service
-
● SQL Service is the entry point for working along structured data in Spark.
It allows the creation of DataFrame objects as well as the execution of
b
SQL queries.
3 l a
Big Data
S 22
Components
I T
U
Spark SQL - Features
-
● Integration With Spark
● Uniform Data Access
b
● Hive Compatibility
Standard Connectivity
a
●
l
● Performance And Scalability
● User Defined Functions
Big Data
S 3 23
Components
I T
U
GraphX
-
● GraphX is the Spark API for graphs and graph-parallel computation. Thus,
it extends the Spark RDD with a Resilient Distributed Property Graph. The
b
property graph is a directed multigraph which can have multiple edges in
l a
parallel. Every edge and vertex have user defined properties associated
3
with it. Here, the parallel edges allow multiple relationships between the
same vertices.
Big Data
S 24
Components
I T
U
GraphX
-
● GraphX exposes a set of fundamental operators (e.g., subgraph,
joinVertices, and mapReduceTriplets) as well as an optimized variant of
b
the Pregel API.
l a
● In addition, GraphX includes a growing collection of graph algorithms and
3
builders to simplify graph analytics tasks.
● GraphX unifies ETL (Extract, Transform & Load) process, exploratory
S
analysis and iterative graph computation within a single system.
25
Big Data
Components
I T
U
GraphX - use cases
-
● Disaster Detection System
● Page Rank
b
● Financial Fraud Detection
l a
● Business Analysis
○ Machine Learning, understanding the customer purchase trends
3
● Geographic Information Systems
S
○ Watershed delineation and weather prediction
● Google Pregel
26
Big Data
Components
I T
U
GraphX - Graph and Examples
b -
3 l a
Big Data
S 27
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX
b -
3 l a
Big Data
S 28
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX
b -
3 l a
Big Data
S 29
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX
b -
3 l a
Big Data
S 30
Components
I T
U
MLLib
-
● stands for Machine Learning Library. Spark MLlib is used to perform
machine learning in Apache Spark.
l ab
Big Data
S 3 31
Components
I T
U
MLLib - Algorithms
-
● Basic Statistics: Summary, Correlation, Stratified Sampling, Hypothesis
Testing, Random Data Generation.
b
● Regression
● Classification
l a
● Recommendation System: Collaborative, Content-Based
● Clustering
3
● Dimensionality Reduction: Feature Selection, Feature Extraction
● Feature Extraction
S
● Optimization
32
Big Data
Components
I T
U
MLLib - Use case
-
● Earthquake Detection System
l ab
Big Data
S 3 33
Components
I T
U
MLLib - Use case
-
● Movie Recommendation System
l ab
Big Data
S 3 34
Components
I T
U
MLLib - Use case
-
● Movie Recommendation System
l ab
Big Data
S 3 35
Challenges of Distributed computing
I T
● How to divide the input data
- U
● How to assign the divided data to machines in the cluster
b
● How to check and monitor a machine in the cluster is live and has
a
resources to perform its duty
l
● How to retry or reassign failed chunks to another machine or worker
Big Data
S 3 36
Challenges of Distributed computing
I T
- U
● If the computation involves any aggregation operation like a sum, how to
collate results from many workers and compute the aggregation
b
● Efficient use of memory , cpu and network
a
● Monitoring the tasks
l
● Overall job coordination
3
● Keeping a global time
Big Data
S 37
Usecases
I T
● ETL
● Analytics
- U
b
● Machine Learning
a
● Graph processing
l
● SQL queries on large data sets
3
● Batch processing
S
● Stream processing
38
Big Data
Features
I T
● Multiple languages support namely Java, R,
- U
Scala, Python for building applications. Spark
b
provides high-level APIs in Java, Scala, Python
and R. Spark code can be written in any of
3 l a
these four languages. It provides a shell in
Scala and Python. The Scala shell can be
accessed through ./bin/spark-shell and Python
S
shell through ./bin/pyspark from the installed
directory. 39
Big Data
Features
I T
- U
● Fast Speed in Data Processing, 10 times faster on Disk and 100 times
swifter in Memory (compare to Hadoop). Spark is able to achieve this
b
speed through controlled partitioning. It manages data using partitions
that help parallelize distributed data processing with minimal network
traffic.
3 l a
● Spark with abstraction-RDD provides Fault Tolerance with ensured Zero
Data loss
Big Data
S 40
Features
I T
- U
● Increase in system efficiency due to Lazy Evaluation
of transformation in RDD: Apache Spark delays its
b
evaluation till it is absolutely necessary. This is one of
the key factors contributing to its speed. For
3 l a
transformations, Spark adds them to a DAG
(Directed Acyclic Graph) of computation and only
when the driver requests some data, does this DAG
S
actually gets executed.
41
Big Data
Features
I T
- U
l ab
Big Data
S 3 42
Features
I T
data-flow
- U
● In-Memory Processing resulting in high computation speed and acyclic
b
● With 80- high level operators it is easy to develop Dynamic & Parallel
applications
l a
● Real-time data stream processing with Spark Streaming
3
Big Data
S 43
Features
I T
- U
● Flexible to run Independently and can be integrated
with Hadoop Yarn Cluster Manager. Apache Spark
b
provides smooth compatibility with Hadoop. This is a
boon for all the Big Data engineers who started their
3 l a
careers with Hadoop. Spark is a potential replacement
for the MapReduce functions of Hadoop, while Spark
has the ability to run on top of an existing Hadoop
S
cluster using YARN for resource scheduling.
44
Big Data
Features
I T
- U
● Cost Efficient for Big data as minimal need of storage and data center
● Futuristic analysis with built-in tools for machine learning (MLLib),
b
interactive queries & data streaming
● Persistence and Immutable in nature with data paralleling processing over
the cluster
3 l a
● Graphx simplifies Graph Analytics by collecting algorithm and builders
● Re-using Code for batch processing and to run ad-hoc queries
S
● Progressive and expanding Apache community active for Quick Assistance
45
Big Data
Features
I T
● Spark supports multiple data sources such as
Parquet, JSON, Hive and Cassandra apart
- U
b
from the usual formats such as text files, CSV
and RDBMS tables. The Data Source API
3 l a
provides a pluggable mechanism for
accessing structured data though Spark SQL.
Data sources can be more than just simple
S
pipes that convert data and pull it into Spark.
46
Big Data
Spark build on Hadoop
I T
- U
l ab
Big Data
S 3 47
RDD - Resilient Distributed Dataset
I T
U
Definition
-
● Resilient Distributed Dataset is the fundamental data structure
abstraction of Spark
b
● RDD is a collection of elements partitioned across the nodes of the
l a
cluster that can be operated on in parallel. For instance you can create
3
an RDD of integers and these gets partitioned and divided and
assigned to various nodes in the cluster for parallel processing.
S
● RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
48
Big Data
RDD - Resilient Distributed Dataset
I T
U
Definition
-
● Resilient - They are fault tolerant and able to detect and recompute
missing or damaged partitions of an RDD due to node or network
b
failures.
l a
● Distributed - Data is partitioned and resides on multiple nodes
3
depending on the cluster size, type and configuration
● In Memory Data Structure - Mostly they are in memory so that
S
iterative operations runs faster and performs way better than
traditional hadoop programs in executing iterative algorithms
49
Big Data
RDD - Resilient Distributed Dataset
I T
U
Definition
-
● Dataset - it can represent any form of data be it reading from a csv
file, loading data from a table using rdbms, text file, json , xml.
l ab
Big Data
S 3 50
RDD - Resilient Distributed Dataset
I T
U
Workflow
-
● Create RDDs
○ An existing collection in your driver program.
b
○ Referencing a dataset in an external storage system.
l a
● 2 types of operations
○ Transformation: to create new RDD.
3
○ Actions: applied on an RDD to instruct
S
Spark to apply computation and pass the result back to the driver.
51
Big Data
RDD - Resilient Distributed Dataset
I T
U
Partition and Parallelism
-
● Partition - is a logical chunk of a large distributed data set. By default.
Spark tries to read data into an RDD from the nodes that are close to
b
it.
3 l a
Big Data
S 52
RDD - Resilient Distributed Dataset
I T
U
Partition and Parallelism
b -
3 l a
Big Data
S 53
Spark Program
I T
U
Architecture
b -
3 l a
Big Data
S 54
Spark Program
I T
U
Architecture
b -
3 l a
Big Data
S 55
Spark Program
I T
U
Architecture
b -
3 l a
Big Data
S 56
Spark Program
I T
U
Architecture
b -
3 l a
Big Data
S 57
Spark Program
I T
U
Directed Acyclic Graph (DAG) Visualization
b -
3 l a
Big Data
S 58
Q&A
I T
- U
l ab
3
Cảm ơn đã theo dõi
S
Chúng tôi hy vọng cùng nhau đi đến thành công.
59
Big Data