0% found this document useful (0 votes)

19 views59 pages

Slide 7 Spark Introduction

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views59 pages

Slide 7 Spark Introduction

Uploaded by

putinphuc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

I T

Big Data
(Spark)
- U
l ab Instructor: Trong-Hop Do
November 24th 2020

S3 S3Lab
Smart Software System Laboratory

1
I T
- U
“Big data is at the foundation of all the

b
megatrends that are happening today, from

a
social to mobile to cloud to gaming.”

l
– Chris Lynch, Vertica Systems

Big Data
S 3 2
Introduction
I T
- U
● Apache Spark is an open-source cluster computing framework for real-
time processing

b
● Spark provides an interface for programming entire clusters with implicit

l a
data parallelism and fault-tolerance.
● Spark is designed to cover a wide range of workloads such as batch

3
applications, iterative algorithms, interactive queries and streaming

Big Data
S 3
Introduction
I T
- U
l ab
Big Data
S 3 4
Introduction
I T
- U
l ab
S 3 5
Applications
I T
- U
l ab
Big Data
S 3 6
Components
I T
- U
● Spark Core and Resilient Distributed Datasets or RDDs
● Spark SQL

b
● Spark Streaming

l a
● Machine Learning Library or MLlib

3
● GraphX

Big Data
S 7
Components
I T
- U
l ab
Big Data
S 3 8
Components
I T
U
Spark Core

-
● The base engine for large-scale parallel and distributed data processing.
The core is the distributed execution engine and the Java, Scala, and

b
Python APIs offer a platform for distributed ETL application development.

l a
Further, additional libraries which are built atop the core allow diverse

3
workloads for streaming, SQL, and machine learning. It is responsible for:
○ Memory management and fault recovery

S
○ Scheduling, distributing and monitoring jobs on a cluster
○ Interacting with storage systems
9
Big Data
Components
I T
U
Spark Streaming

-
● Used to process real-time streaming data.
● It enables high-throughput and fault-tolerant stream processing of live

b
data streams. The fundamental stream unit is DStream which is basically

l a
a series of RDDs (Resilient Distributed Datasets) to process the real-time

3
data.

Big Data
S 10
Components
I T
U
Spark Streaming - Workflow

b -
3 l a
Big Data
S 11
Components
I T
U
Spark Streaming - Fundamentals

-
● Streaming Context
● DStream (Discretized Stream)

b
● Caching

l a
● Accumulators, Broadcast Variables and Checkpoints

Big Data
S 3 12
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 13
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 14
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 15
Components
I T
U
Spark Streaming - Fundamentals

b -
3 l a
Big Data
S 16
Components
I T
U
Spark SQL

-
● Spark SQL integrates relational processing with Spark functional
programming. Provides support for various data sources and makes it

b
possible to weave SQL queries with code transformations thus resulting in

l a
a very powerful tool. The following are the four libraries of Spark SQL.

3
○ Data Source API
○ DataFrame API

S
○ Interpreter & Optimizer
○ SQL Service
17
Big Data
Components
I T
U
Spark SQL

b -
3 l a
Big Data
S 18
Components
I T
U
Spark SQL - Data Source API

-
● Universal API for loading and storing structured data.
○ Built in support for Hive, JSON, Avro, JDBC, Parquet, ect.

b
○ Support third party integration through spark packages
○ Support for smart sources

a
○ Data Abstraction and Domain Specific Language (DSL) applicable on structure and semi-

l
structured data
○ Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage

3
systems (HDFS, HIVE Tables, MySQL, etc.)
○ Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to

S
○
multi-node clusters.

19
Big Data
Components
I T
U
Spark SQL - DataFrame API

-
● A Data Frame is a distributed collection of data organized into named
column. It is equivalent to a relational table in SQL used for storing data

b
into tables.

3 l a
Big Data
S 20
Components
I T
U
Spark SQL - SQL Interpreter And Optimizer

-
● based on functional programming constructed in Scala.
○ provides a general framework for transforming trees, which is used to perform

b
analysis/evaluation, optimization, planning, and run time code spawning.
○ This supports cost based optimization (run time and resource utilization is termed as cost)

a
and rule based optimization, making queries run much faster than their RDD (Resilient

l
Distributed Dataset) counterparts.

Big Data
S 3 21
Components
I T
U
Spark SQL - SQL Service

-
● SQL Service is the entry point for working along structured data in Spark.
It allows the creation of DataFrame objects as well as the execution of

b
SQL queries.

3 l a
Big Data
S 22
Components
I T
U
Spark SQL - Features

-
● Integration With Spark
● Uniform Data Access

b
● Hive Compatibility
Standard Connectivity

a
●

l
● Performance And Scalability
● User Defined Functions

Big Data
S 3 23
Components
I T
U
GraphX

-
● GraphX is the Spark API for graphs and graph-parallel computation. Thus,
it extends the Spark RDD with a Resilient Distributed Property Graph. The

b
property graph is a directed multigraph which can have multiple edges in

l a
parallel. Every edge and vertex have user defined properties associated

3
with it. Here, the parallel edges allow multiple relationships between the
same vertices.

Big Data
S 24
Components
I T
U
GraphX

-
● GraphX exposes a set of fundamental operators (e.g., subgraph,
joinVertices, and mapReduceTriplets) as well as an optimized variant of

b
the Pregel API.

l a
● In addition, GraphX includes a growing collection of graph algorithms and

3
builders to simplify graph analytics tasks.
● GraphX unifies ETL (Extract, Transform & Load) process, exploratory

S
analysis and iterative graph computation within a single system.

25
Big Data
Components
I T
U
GraphX - use cases

-
● Disaster Detection System
● Page Rank

b
● Financial Fraud Detection

l a
● Business Analysis
○ Machine Learning, understanding the customer purchase trends

3
● Geographic Information Systems

S
○ Watershed delineation and weather prediction

● Google Pregel
26
Big Data
Components
I T
U
GraphX - Graph and Examples

b -
3 l a
Big Data
S 27
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX

b -
3 l a
Big Data
S 28
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX

b -
3 l a
Big Data
S 29
Components
I T
U
GraphX - Flight Data Analysis using Spark GraphX

b -
3 l a
Big Data
S 30
Components
I T
U
MLLib

-
● stands for Machine Learning Library. Spark MLlib is used to perform
machine learning in Apache Spark.

l ab
Big Data
S 3 31
Components
I T
U
MLLib - Algorithms

-
● Basic Statistics: Summary, Correlation, Stratified Sampling, Hypothesis
Testing, Random Data Generation.

b
● Regression
● Classification

l a
● Recommendation System: Collaborative, Content-Based
● Clustering

3
● Dimensionality Reduction: Feature Selection, Feature Extraction
● Feature Extraction

S
● Optimization

32
Big Data
Components
I T
U
MLLib - Use case

-
● Earthquake Detection System

l ab
Big Data
S 3 33
Components
I T
U
MLLib - Use case

-
● Movie Recommendation System

l ab
Big Data
S 3 34
Components
I T
U
MLLib - Use case

-
● Movie Recommendation System

l ab
Big Data
S 3 35
Challenges of Distributed computing
I T
● How to divide the input data

- U
● How to assign the divided data to machines in the cluster

b
● How to check and monitor a machine in the cluster is live and has

a
resources to perform its duty

l
● How to retry or reassign failed chunks to another machine or worker

Big Data
S 3 36
Challenges of Distributed computing
I T
- U
● If the computation involves any aggregation operation like a sum, how to
collate results from many workers and compute the aggregation

b
● Efficient use of memory , cpu and network

a
● Monitoring the tasks

l
● Overall job coordination

3
● Keeping a global time

Big Data
S 37
Usecases
I T
● ETL
● Analytics

- U
b
● Machine Learning

a
● Graph processing

l
● SQL queries on large data sets

3
● Batch processing

S
● Stream processing

38
Big Data
Features
I T
● Multiple languages support namely Java, R,

- U
Scala, Python for building applications. Spark

b
provides high-level APIs in Java, Scala, Python
and R. Spark code can be written in any of

3 l a
these four languages. It provides a shell in
Scala and Python. The Scala shell can be
accessed through ./bin/spark-shell and Python

S
shell through ./bin/pyspark from the installed
directory. 39
Big Data
Features
I T
- U
● Fast Speed in Data Processing, 10 times faster on Disk and 100 times
swifter in Memory (compare to Hadoop). Spark is able to achieve this

b
speed through controlled partitioning. It manages data using partitions
that help parallelize distributed data processing with minimal network
traffic.

3 l a
● Spark with abstraction-RDD provides Fault Tolerance with ensured Zero
Data loss

Big Data
S 40
Features
I T
- U
● Increase in system efficiency due to Lazy Evaluation
of transformation in RDD: Apache Spark delays its

b
evaluation till it is absolutely necessary. This is one of
the key factors contributing to its speed. For

3 l a
transformations, Spark adds them to a DAG
(Directed Acyclic Graph) of computation and only
when the driver requests some data, does this DAG

S
actually gets executed.

41
Big Data
Features
I T
- U
l ab
Big Data
S 3 42
Features
I T
data-flow

- U
● In-Memory Processing resulting in high computation speed and acyclic

b
● With 80- high level operators it is easy to develop Dynamic & Parallel
applications

l a
● Real-time data stream processing with Spark Streaming

3
Big Data
S 43
Features
I T
- U
● Flexible to run Independently and can be integrated
with Hadoop Yarn Cluster Manager. Apache Spark

b
provides smooth compatibility with Hadoop. This is a
boon for all the Big Data engineers who started their

3 l a
careers with Hadoop. Spark is a potential replacement
for the MapReduce functions of Hadoop, while Spark
has the ability to run on top of an existing Hadoop

S
cluster using YARN for resource scheduling.

44
Big Data
Features
I T
- U
● Cost Efficient for Big data as minimal need of storage and data center
● Futuristic analysis with built-in tools for machine learning (MLLib),

b
interactive queries & data streaming
● Persistence and Immutable in nature with data paralleling processing over
the cluster

3 l a
● Graphx simplifies Graph Analytics by collecting algorithm and builders
● Re-using Code for batch processing and to run ad-hoc queries

S
● Progressive and expanding Apache community active for Quick Assistance

45
Big Data
Features
I T
● Spark supports multiple data sources such as
Parquet, JSON, Hive and Cassandra apart

- U
b
from the usual formats such as text files, CSV
and RDBMS tables. The Data Source API

3 l a
provides a pluggable mechanism for
accessing structured data though Spark SQL.
Data sources can be more than just simple

S
pipes that convert data and pull it into Spark.

46
Big Data
Spark build on Hadoop
I T
- U
l ab
Big Data
S 3 47
RDD - Resilient Distributed Dataset
I T
U
Definition

-
● Resilient Distributed Dataset is the fundamental data structure
abstraction of Spark

b
● RDD is a collection of elements partitioned across the nodes of the

l a
cluster that can be operated on in parallel. For instance you can create

3
an RDD of integers and these gets partitioned and divided and
assigned to various nodes in the cluster for parallel processing.

S
● RDDs can contain any type of Python, Java, or Scala objects, including
user-defined classes.
48
Big Data
RDD - Resilient Distributed Dataset
I T
U
Definition

-
● Resilient - They are fault tolerant and able to detect and recompute
missing or damaged partitions of an RDD due to node or network

b
failures.

l a
● Distributed - Data is partitioned and resides on multiple nodes

3
depending on the cluster size, type and configuration
● In Memory Data Structure - Mostly they are in memory so that

S
iterative operations runs faster and performs way better than
traditional hadoop programs in executing iterative algorithms
49
Big Data
RDD - Resilient Distributed Dataset
I T
U
Definition

-
● Dataset - it can represent any form of data be it reading from a csv
file, loading data from a table using rdbms, text file, json , xml.

l ab
Big Data
S 3 50
RDD - Resilient Distributed Dataset
I T
U
Workflow

-
● Create RDDs
○ An existing collection in your driver program.

b
○ Referencing a dataset in an external storage system.

l a
● 2 types of operations
○ Transformation: to create new RDD.

3
○ Actions: applied on an RDD to instruct

S
Spark to apply computation and pass the result back to the driver.

51
Big Data
RDD - Resilient Distributed Dataset
I T
U
Partition and Parallelism

-
● Partition - is a logical chunk of a large distributed data set. By default.
Spark tries to read data into an RDD from the nodes that are close to

b
it.

3 l a
Big Data
S 52
RDD - Resilient Distributed Dataset
I T
U
Partition and Parallelism

b -
3 l a
Big Data
S 53
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 54
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 55
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 56
Spark Program
I T
U
Architecture

b -
3 l a
Big Data
S 57
Spark Program
I T
U
Directed Acyclic Graph (DAG) Visualization

b -
3 l a
Big Data
S 58
Q&A
I T
- U
l ab
3
Cảm ơn đã theo dõi

S
Chúng tôi hy vọng cùng nhau đi đến thành công.

59
Big Data

Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
No ratings yet
Big Data Topic3 (Spark) (Thanh Binh Nguyen) .TextMark
60 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Lec 11
No ratings yet
Lec 11
8 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Apache Spark Components
No ratings yet
Apache Spark Components
4 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
BDA1
No ratings yet
BDA1
17 pages
Spark SQL PPT 3.2.3 and 3.2.4
No ratings yet
Spark SQL PPT 3.2.3 and 3.2.4
17 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Bda U4
No ratings yet
Bda U4
49 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark
No ratings yet
Spark
9 pages
Module 3
No ratings yet
Module 3
51 pages
Spark BD
No ratings yet
Spark BD
9 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Big Data Topic1 (Introduction) (Thanh Binh Nguyen) .TextMark
No ratings yet
Big Data Topic1 (Introduction) (Thanh Binh Nguyen) .TextMark
46 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Shark
No ratings yet
Shark
24 pages
Unit 4
No ratings yet
Unit 4
60 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Module 2
No ratings yet
Module 2
20 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Apache Spark A Comprehensive Guide
No ratings yet
Apache Spark A Comprehensive Guide
9 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Big Data Processing With Apache Spark - Infoqdotcom
No ratings yet
Big Data Processing With Apache Spark - Infoqdotcom
16 pages
4a.introduction To Apache Spark
No ratings yet
4a.introduction To Apache Spark
28 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Spark2x: Big Data Huawei Course
No ratings yet
Spark2x: Big Data Huawei Course
25 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
No ratings yet
Big Data Analysis Using Apache Spark Mllib and Hadoop Hdfs With Scala and Java
8 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Sparkr: Scaling R Programs With Spark: Data Sources
No ratings yet
Sparkr: Scaling R Programs With Spark: Data Sources
6 pages
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Apache Spark Machine Learning Blueprints
From Everand
Apache Spark Machine Learning Blueprints
Alex Liu
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
C++ - Unit II
No ratings yet
C++ - Unit II
109 pages
Development Module System Admintration: JOIN 1 Technology and Get Another FREE
No ratings yet
Development Module System Admintration: JOIN 1 Technology and Get Another FREE
5 pages
Chapter 1
No ratings yet
Chapter 1
10 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
Etsy Com Source
No ratings yet
Etsy Com Source
7 pages
How To Use SPI From Linux Userland With Spidev - Stm32mpu
No ratings yet
How To Use SPI From Linux Userland With Spidev - Stm32mpu
4 pages
Power Off Reset Reason Backup
No ratings yet
Power Off Reset Reason Backup
4 pages
Package Arrow': November 20, 2021
No ratings yet
Package Arrow': November 20, 2021
81 pages
Good Practices in Library Design, PDF
No ratings yet
Good Practices in Library Design, PDF
10 pages
Oracle Train4sure 1z0-447 v2018-12-27 by Eric 46q
No ratings yet
Oracle Train4sure 1z0-447 v2018-12-27 by Eric 46q
21 pages
Fusion For Apple Silicon Companion - v25
No ratings yet
Fusion For Apple Silicon Companion - v25
78 pages
G4-RevisionSheet-Comp-Second Term
No ratings yet
G4-RevisionSheet-Comp-Second Term
8 pages
CS 1101 Programming Assignment Unit 4
No ratings yet
CS 1101 Programming Assignment Unit 4
7 pages
C++ Programming
100% (1)
C++ Programming
23 pages
HTML Tags Chart: Tag Name Code Example Browser View
No ratings yet
HTML Tags Chart: Tag Name Code Example Browser View
9 pages
FIT9136 Week 2-Python Basic Elements
No ratings yet
FIT9136 Week 2-Python Basic Elements
37 pages
Unit-3 Conditional Statements
No ratings yet
Unit-3 Conditional Statements
8 pages
Smart Education Sih
No ratings yet
Smart Education Sih
6 pages
Middleware Technologies: Compiled By: Thomas M. Cosley
No ratings yet
Middleware Technologies: Compiled By: Thomas M. Cosley
13 pages
Fuzz or Lose - Kostya Serebryany - CppCon 2017
No ratings yet
Fuzz or Lose - Kostya Serebryany - CppCon 2017
51 pages
Chapter 19 Notes
No ratings yet
Chapter 19 Notes
8 pages
Manual Testing Q N A
No ratings yet
Manual Testing Q N A
18 pages
Topic 7.4 Linear Search & Bubble Sort
No ratings yet
Topic 7.4 Linear Search & Bubble Sort
10 pages
SE-Course-Outline-Bahria University-Spring-2023-24032023-023806am-20092023-041506pm-04102023-100012am-15022024-095334am-12092024-014424pm
No ratings yet
SE-Course-Outline-Bahria University-Spring-2023-24032023-023806am-20092023-041506pm-04102023-100012am-15022024-095334am-12092024-014424pm
2 pages
Compiler Design & Networks Lab Manual
No ratings yet
Compiler Design & Networks Lab Manual
69 pages
Stucor Cs3391 ND
No ratings yet
Stucor Cs3391 ND
324 pages
FANUC PICTURE2 Application Placement Procedure On CNC
No ratings yet
FANUC PICTURE2 Application Placement Procedure On CNC
9 pages
Agile Scrum Master
No ratings yet
Agile Scrum Master
16 pages
RoadMap C++
No ratings yet
RoadMap C++
3 pages
Tessent Command Study Notes 2
No ratings yet
Tessent Command Study Notes 2
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Slide 7 Spark Introduction

Uploaded by

Slide 7 Spark Introduction

Uploaded by

I T

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.