0% found this document useful (0 votes)
19 views59 pages

Slide 7 Spark Introduction

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views59 pages

Slide 7 Spark Introduction

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

I T

Big Data
(Spark)
- U
l ab Instructor: Trong-Hop Do
November 24th 2020

S3 S3Lab
Smart Software System Laboratory

1
I T
- U
“Big data is at the foundation of all the

b
megatrends that are happening today, from

a
social to mobile to cloud to gaming.”

l
– Chris Lynch, Vertica Systems

Big Data
S 3 2
Introduction
I T
- U
● Apache Spark is an open-source cluster computing framework for real-
time processing

b
● Spark provides an interface for programming entire clusters with implicit

l a
data parallelism and fault-tolerance.
● Spark is designed to cover a wide range of workloads such as batch

3
applications, iterative algorithms, interactive queries and streaming

Big Data
S 3
Introduction
I T
- U
l ab
Big Data
S 3 4
Introduction
I T
- U
l ab
S 3 5
Applications
I T
- U
l ab
Big Data
S 3 6
Components
I T
- U
● Spark Core and Resilient Distributed Datasets or RDDs
● Spark SQL

b
● Spark Streaming

l a
● Machine Learning Library or MLlib

3
● GraphX

Big Data
S 7
Components
I T
- U
l ab
Big Data
S 3 8
Components
I T
U
Spark Core

-
● The base engine for large-scale parallel and distributed data processing.
The core is the distributed execution engine and the Java, Scala, and

b
Python APIs offer a platform for distributed ETL application development.

l a
Further, additional libraries which are built atop the core allow diverse

3
workloads for streaming, SQL, and machine learning. It is responsible for:
○ Memory management and fault recovery

S
○ Scheduling, distributing and monitoring jobs on a cluster
○ Interacting with storage systems
9
Big Data
Components
I T
U
Spark Streaming

-
● Used to process real-time streaming data.
● It enables high-throughput and fault-tolerant stream processing of live

b
data streams. The fundamental stream unit is DStream which is basically

l a
a series of RDDs (Resilient Distributed Datasets) to process the real-time

3
data.

Big Data
S 10
Components
I T
U
Spark Streaming - Workflow

b -
3 l a
Big Data
S 11
Components
I T
U
Spark Streaming - Fundamentals

-
● Streaming Context
● DStream (Discretized Stream)

b
● Caching

l a
● Accumulators, Broadcast Variables and Checkpoints

Big Data
S 3 12
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 13
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 14
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 15
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 16
Components
I T
U
Spark SQL

-
● Spark SQL integrates relational processing with Spark functional
programming. Provides support for various data sources and makes it

b
possible to weave SQL queries with code transformations thus resulting in

l a
a very powerful tool. The following are the four libraries of Spark SQL.

3
○ Data Source API
○ DataFrame API

S
○ Interpreter & Optimizer
○ SQL Service
17
Big Data
Components
I T
U
Spark SQL

b -
3 l a
Big Data
S 18
Components
I T
U
Spark SQL - Data Source API

-
● Universal API for loading and storing structured data.
○ Built in support for Hive, JSON, Avro, JDBC, Parquet, ect.

b
○ Support third party integration through spark packages
○ Support for smart sources

a
○ Data Abstraction and Domain Specific Language (DSL) applicable on structure and semi-

l
structured data
○ Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage

3
systems (HDFS, HIVE Tables, MySQL, etc.)
○ Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to

S

multi-node clusters.

19
Big Data
Components
I T
U
Spark SQL - DataFrame API

-
● A Data Frame is a distributed collection of data organized into named
column. It is equivalent to a relational table in SQL used for storing data

b
into tables.

3 l a
Big Data
S 20
Components
I T
U
Spark SQL - SQL Interpreter And Optimizer

-
● based on functional programming constructed in Scala.
○ provides a general framework for transforming trees, which is used to perform

b
analysis/evaluation, optimization, planning, and run time code spawning.
○ This supports cost based optimization (run time and resource utilization is termed as cost)

a
and rule based optimization, making queries run much faster than their RDD (Resilient

l
Distributed Dataset) counterparts.

Big Data
S 3 21
Components
I T
U
Spark SQL - SQL Service

-
● SQL Service is the entry point for working along structured data in Spark.
It allows the creation of DataFrame objects as well as the execution of

b
SQL queries.

3 l a
Big Data
S 22
Components
I T
U
Spark SQL - Features

-
● Integration With Spark
● Uniform Data Access

b
● Hive Compatibility
Standard Connectivity

a

l
● Performance And Scalability
● User Defined Functions

Big Data
S 3 23
Components
I T
U
GraphX

-
● GraphX is the Spark API for graphs and graph-parallel computation. Thus,
it extends the Spark RDD with a Resilient Distributed Property Graph. The

b
property graph is a directed multigraph which can have multiple edges in

l a
parallel. Every edge and vertex have user defined properties associated

3
with it. Here, the parallel edges allow multiple relationships between the
same vertices.

Big Data
S 24
Components
I T
U
GraphX

-
● GraphX exposes a set of fundamental operators (e.g., subgraph,
joinVertices, and mapReduceTriplets) as well as an optimized variant of

b
the Pregel API.

l a
● In addition, GraphX includes a growing collection of graph algorithms and

3
builders to simplify graph analytics tasks.
● GraphX unifies ETL (Extract, Transform & Load) process, exploratory

S
analysis and iterative graph computation within a single system.

25
Big Data
Components
I T
U
GraphX - use cases

-
● Disaster Detection System
● Page Rank

b
● Financial Fraud Detection

l a
● Business Analysis
○ Machine Learning, understanding the customer purchase trends

3
● Geographic Information Systems

S
○ Watershed delineation and weather prediction

● Google Pregel
26
Big Data
Components
I T
U
GraphX - Graph and Examples

b -
3 l a
Big Data
S 27
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX

b -
3 l a
Big Data
S 28
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX

b -
3 l a
Big Data
S 29
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX

b -
3 l a
Big Data
S 30
Components
I T
U
MLLib

-
● stands for Machine Learning Library. Spark MLlib is used to perform
machine learning in Apache Spark.

l ab
Big Data
S 3 31
Components
I T
U
MLLib - Algorithms

-
● Basic Statistics: Summary, Correlation, Stratified Sampling, Hypothesis
Testing, Random Data Generation.

b
● Regression
● Classification

l a
● Recommendation System: Collaborative, Content-Based
● Clustering

3
● Dimensionality Reduction: Feature Selection, Feature Extraction
● Feature Extraction

S
● Optimization

32
Big Data
Components
I T
U
MLLib - Use case

-
● Earthquake Detection System

l ab
Big Data
S 3 33
Components
I T
U
MLLib - Use case

-
● Movie Recommendation System

l ab
Big Data
S 3 34
Components
I T
U
MLLib - Use case

-
● Movie Recommendation System

l ab
Big Data
S 3 35
Challenges of Distributed computing
I T
● How to divide the input data

- U
● How to assign the divided data to machines in the cluster

b
● How to check and monitor a machine in the cluster is live and has

a
resources to perform its duty

l
● How to retry or reassign failed chunks to another machine or worker

Big Data
S 3 36
Challenges of Distributed computing
I T
- U
● If the computation involves any aggregation operation like a sum, how to
collate results from many workers and compute the aggregation

b
● Efficient use of memory , cpu and network

a
● Monitoring the tasks

l
● Overall job coordination

3
● Keeping a global time

Big Data
S 37
Usecases
I T
● ETL
● Analytics

- U
b
● Machine Learning

a
● Graph processing

l
● SQL queries on large data sets

3
● Batch processing

S
● Stream processing

38
Big Data
Features
I T
● Multiple languages support namely Java, R,

- U
Scala, Python for building applications. Spark

b
provides high-level APIs in Java, Scala, Python
and R. Spark code can be written in any of

3 l a
these four languages. It provides a shell in
Scala and Python. The Scala shell can be
accessed through ./bin/spark-shell and Python

S
shell through ./bin/pyspark from the installed
directory. 39
Big Data
Features
I T
- U
● Fast Speed in Data Processing, 10 times faster on Disk and 100 times
swifter in Memory (compare to Hadoop). Spark is able to achieve this

b
speed through controlled partitioning. It manages data using partitions
that help parallelize distributed data processing with minimal network
traffic.

3 l a
● Spark with abstraction-RDD provides Fault Tolerance with ensured Zero
Data loss

Big Data
S 40
Features
I T
- U
● Increase in system efficiency due to Lazy Evaluation
of transformation in RDD: Apache Spark delays its

b
evaluation till it is absolutely necessary. This is one of
the key factors contributing to its speed. For

3 l a
transformations, Spark adds them to a DAG
(Directed Acyclic Graph) of computation and only
when the driver requests some data, does this DAG

S
actually gets executed.

41
Big Data
Features
I T
- U
l ab
Big Data
S 3 42
Features
I T
data-flow

- U
● In-Memory Processing resulting in high computation speed and acyclic

b
● With 80- high level operators it is easy to develop Dynamic & Parallel
applications

l a
● Real-time data stream processing with Spark Streaming

3
Big Data
S 43
Features
I T
- U
● Flexible to run Independently and can be integrated
with Hadoop Yarn Cluster Manager. Apache Spark

b
provides smooth compatibility with Hadoop. This is a
boon for all the Big Data engineers who started their

3 l a
careers with Hadoop. Spark is a potential replacement
for the MapReduce functions of Hadoop, while Spark
has the ability to run on top of an existing Hadoop

S
cluster using YARN for resource scheduling.

44
Big Data
Features
I T
- U
● Cost Efficient for Big data as minimal need of storage and data center
● Futuristic analysis with built-in tools for machine learning (MLLib),

b
interactive queries & data streaming
● Persistence and Immutable in nature with data paralleling processing over
the cluster

3 l a
● Graphx simplifies Graph Analytics by collecting algorithm and builders
● Re-using Code for batch processing and to run ad-hoc queries

S
● Progressive and expanding Apache community active for Quick Assistance

45
Big Data
Features
I T
● Spark supports multiple data sources such as
Parquet, JSON, Hive and Cassandra apart

- U
b
from the usual formats such as text files, CSV
and RDBMS tables. The Data Source API

3 l a
provides a pluggable mechanism for
accessing structured data though Spark SQL.
Data sources can be more than just simple

S
pipes that convert data and pull it into Spark.

46
Big Data
Spark build on Hadoop
I T
- U
l ab
Big Data
S 3 47
RDD - Resilient Distributed Dataset
I T
U
Definition

-
● Resilient Distributed Dataset is the fundamental data structure
abstraction of Spark

b
● RDD is a collection of elements partitioned across the nodes of the

l a
cluster that can be operated on in parallel. For instance you can create

3
an RDD of integers and these gets partitioned and divided and
assigned to various nodes in the cluster for parallel processing.

S
● RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
48
Big Data
RDD - Resilient Distributed Dataset
I T
U
Definition

-
● Resilient - They are fault tolerant and able to detect and recompute
missing or damaged partitions of an RDD due to node or network

b
failures.

l a
● Distributed - Data is partitioned and resides on multiple nodes

3
depending on the cluster size, type and configuration
● In Memory Data Structure - Mostly they are in memory so that

S
iterative operations runs faster and performs way better than
traditional hadoop programs in executing iterative algorithms
49
Big Data
RDD - Resilient Distributed Dataset
I T
U
Definition

-
● Dataset - it can represent any form of data be it reading from a csv
file, loading data from a table using rdbms, text file, json , xml.

l ab
Big Data
S 3 50
RDD - Resilient Distributed Dataset
I T
U
Workflow

-
● Create RDDs
○ An existing collection in your driver program.

b
○ Referencing a dataset in an external storage system.

l a
● 2 types of operations
○ Transformation: to create new RDD.

3
○ Actions: applied on an RDD to instruct

S
Spark to apply computation and pass the result back to the driver.

51
Big Data
RDD - Resilient Distributed Dataset
I T
U
Partition and Parallelism

-
● Partition - is a logical chunk of a large distributed data set. By default.
Spark tries to read data into an RDD from the nodes that are close to

b
it.

3 l a
Big Data
S 52
RDD - Resilient Distributed Dataset
I T
U
Partition and Parallelism

b -
3 l a
Big Data
S 53
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 54
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 55
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 56
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 57
Spark Program
I T
U
Directed Acyclic Graph (DAG) Visualization

b -
3 l a
Big Data
S 58
Q&A
I T
- U
l ab
3
Cảm ơn đã theo dõi

S
Chúng tôi hy vọng cùng nhau đi đến thành công.

59
Big Data

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy