HCIA-Big Data V2.0 Training Material
HCIA-Big Data V2.0 Training Material
HCIA-Big Data V2.0 Training Material
0
Training Material
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 1
Big Data Industry
and Technological
Trends
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 4
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Big Data As a Country Strategy for All Countries
• Group of Eight (G8) has released the G8 Open Data Charter and proposed to
accelerate the implementation of data openness and usability.
USA • The European Union (EU) promotes the Data Value Chain to transform traditional
governance model, reduce common department cost, and accelerate economic growth
and employment growth with big data.
G8 • The Abe Cabinet announced the Declaration to be the World's Most Advanced IT
Nation, which plans to develop Japan's national IT strategy with open public data and
big data as its core.
• The UK Government released the Capacity Development Strategy, which aims to use
UK data to generate business value and boost economic growth, as well as undertakes to
open the core databases in the transportation, weather, medical treatment fields.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Implementing the National Big Data Strategy
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Big Data Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Source of Big Data
• There are more than 200 million messages • There are 2.8 billion smartphone users
every day. worldwide.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Big Data Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
All Businesses Are Data Businesses
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Differences Between Data Processing in the Big Data Era and
the Traditional Data Processing
Modes come ahead of data. Data comes ahead of modes. Modes evolve constantly
Relationship between modes and data
(Ponds come ahead of fishes.) as data increases.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Differences Between Data Processing in the Big Data Era and
the Traditional Data Processing
Modes come ahead of data. Data comes ahead of modes. Modes evolve constantly
Relationship between modes and data
(Ponds come ahead of fishes.) as data increases.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Big Data Era
C hina's netizens rank the first in the world, and the data volume generated each day also surpasses
others in the world.
A camera working at a
Taobao website rate of 8 Mbit/s
• More than 50 thousand GB data • 3.6 GB data can be generated
generated per day. • per hour.
• 40 million GB data storage volume. • Tens of millions GB data can be
generated each month in one city.
Baidu Hospitals
• 1 billion GB data in total. • The amount of CT image data
• 1 trillion web pages stored. generated for a patient reaches
dozens of GB.
• About 6 billion search requests
to be processed each day. • Tens of billions GB data in a country
needs to be stored each year.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Big Data Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Relationship Between Big Data and People
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 16
What Big Data Cannot Do?
Substitute managers' decision-making capabilities
01 • Big data is not only a technical issue, but also a decision-making issue.
• Data onboarding must be pushed and determined by top leaders.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Big Data Era Leading the Future
• Guide the efforts we make now with a clear future target and make
due efforts now to secure future success.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Big Data Application Scope
17%
24%
14%
23% 8%
6%
4% 4%
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Big Data Application: Politics
• Donald Trump employed Cambridge Analytica Ltd (CA) to make personality and requirement
analysis on American voters, which acquired personalities of 220 million Americans.
• CA uses the behavior of giving likes by voters on Facebook to analyze the personality traits and
political orientation of voters, classifies voters into three types (Republican supporters,
The cave (data analysis center) Democratic supporters, and swing voters), and focuses on attracting swing voters.
• Trump has never sent emails before. He bought his first smartphone after the presidential
election and was fascinated with Twitter. The messages sent by him on Twitter are data-riven
and vary for different voters.
• For African Americans, they can see the video in which the black is called predators by Hillary
Clinton, and thereby go away from Hillary's ballot box. These dark posts are visible for only
specified users.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Big Data Application: Finance
Merchandise
Efficiency
New financial customers
Offer standard industrial services. institutions Scenario-focused
Focus on processes and procedures.
Serve
Passively receive information from a
single source. Contact customers by customers
Flexible personalized
services
customer managers. Interact with
each other in fixed channels and in
inflexible ways.
Traditional
finance
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Big Data Application: Finance Case
Taobao website
Four-hour time difference
between the east and west
coasts of the USA.
Walmart
Walmart uses the sales
analysis result of the east
coast to guide the goods
arrangement of the west
coast on the same day.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Big Data Application: Education
N ow, big data analysis has been applied to American public education and become an important
force of education reform.
Sequence of
question-answering 12
in exams 11 1 Academic performance
Big data
Duration and
correctness of 9 in school 3 Dropout rate
answering questions
education
Question- Rate of admission
answering times
8 4 into higher schools
Hand-raising
times in class
7 5 Literacy accuracy
6
Homework correctness
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Big Data Application: Transportation
Most people may choose railway for a distance less than 500 km, but...
• Example mode of transportation for a distance less than 500 km: Beijing-
Taiyuan.
• Mode of transportation during the Price
Chinese Spring Festival in 2018. Time
Shanghai
Chengdu
500 km
500 km
Guangzhou
500 km Plane Train Vehicle rental Long-distance
coach
• For a 500 km or 6-hours driving distance, railway has the highest performance-price ratio, but the chance of
buying tickets depends upon luck. The performance-price ratio of vehicle rental is inferior to entraining.
According to a survey, in the event of failing in scrambling for train tickets, more than 70% of people will rent a
vehicle to go home.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Big Data Application: Tourism
3% 3%
5%
5%
29%
5%
9%
11%
18%
12%
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Big Data Application: Tourism
Honolulu
Colombo
Bali
Okinawa
Jeju
Phuket
Jakarta
Manila
Koh Samui
Kuala Lumpur
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Big Data Application: Government and Public
Security
Public security scenario: automatic warning and response.
Reports to upper-
Area-based people Transaction
City or community level departments
flow threshold > processing
monitoring system
2000 people departments
confirmation
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Big Data Application: Traffic Planning
Traffic planning scenarios: multi-dimensional analysis of the traffic crowd.
North gate of Beijing Workers'
Gymnasium: The number of
people per hour exceeds 500.
Areas with people Sanlitun: The number of people
flow exceeding the
per hour exceeds 800.
threshold once Beijing Workers' Gymnasium:
The number of people per hour
exceeds 1500.
Younger Older
20-30 30-40 Bus Metro Auto Others
than 20 than 50
Traffic forecast based on the Road Network Planning Bus line planning
crowd
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Big Data Applications: Sports
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 30
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Challenges of Traditional Data Processing
Technologies APP APP
Scalability
There is a gap between
required for big
data scalability
data processing
requirements and
hardware performance.
Scale-up Scale-out
Portal Industry Progress
Appframe Spring
• Insufficient batch data processing
Application Middleware (Weblogic8.2 and Apache Tomcat 5.2)
performance.
• Lack of streaming data processing.
DB2 Oracle Sybase TD
• Limited scalability.
• Single data source.
Minicomputer Resource: P595, P570, and P690 Storage Network • External value-added data assets.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Application Scenarios of the Enterprise
Big Data Platform
Operation Management Supervision Profession
• With strong appeals for data analysis in telecom carriers, financial institutions, and governments, new
technologies have been adopted on the Internet to process big data of low value density.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Challenges Faced by Enterprises (1)
• Many enterprises' business departments are not familiar with big data as well as its application scenarios and benefits. Therefore, it
is difficult for them to provide accurate requirements on big data. The requirements of business departments are not clear. The big
data departments are non-profit departments. Therefore, enterprises' decision-makers worry about low input-output ratio and
hesitate to construct a big data department. Moreover, they even delete lots of valuable historical data because there is no
application scenario.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Challenges Faced by Enterprises (2)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Challenges Faced by Enterprises (3)
Challenge 3: Low data availability and poor quality. The problem locating
time is decreased
by 50%.
• Many large and medium enterprises generate a large amount of data each day. However, some
Manual checks are
decreased due to
self-service on
enterprises pay no attention to big data preprocessing, resulting in nonstandard data processing.
problem detection.
be processed, cleaned, and denoised to obtain valid data. According to data from Sybase, if high- Manual participation
is not required due to
quality data availability improves by 10%, enterprise revenue will improve more than 10%. proactive problem
detection.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Challenges Faced by Enterprises (4)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Challenges Faced by Enterprises (5)
• Network-based lives let criminals obtain personal information easily, and also lead to more crime methods
that are difficult to be tracked and prevented.
• How to ensure personal information security becomes an important subject in the big data era. In addition,
with the continuous increase of big data, requirements on the security of physical devices for storing data as
well as on the multi-copy and disaster recovery mechanism of data will become higher and higher.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Challenges Faced by Enterprises (6)
• Each step of big data construction must be completed by professionals. Therefore, it is necessary to develop
and build a professional team that understands big data, knows much about administration, and has
experience in big data applications. Hundreds of thousands of big data-related jobs are increased globally
ever year. More than 1 million talent gaps will appear in the future. Therefore, universities and enterprises
make joint efforts to develop and mining talents.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Challenges Faced by Enterprises (7)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 40
From Batch Processing to Real - Time Analysis
• Hadoop is a basis for batch processing of big data, but Hadoop cannot provide real-time analysis.
Machine Learning
R Connectors
Data Exchange
Columnar Store
SQL Query
Sqoop
Mahout
Workflow
Statistics
Scripting
HBase
Oozie
Hive
Pig
YARN MapReduce v2
ZooKeeper
Log Collector
Coordination
HDFS
Hadoop Distributed File System
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 41
Hadoop Reference Practice in the Industry
R
Intel Distribution for Apache Hadoop* software
Intel R Manager for Apache Hadoop software
Deployment, Configuration, Monitoring, Alerts, and Security
R
Data Exchange Oozie 3.3.0 Pig 0.9.2 Mahout 0.7 connectors
Sqoop1.4.1
Hive 0.9.0
Columnar Store
HBase 0.94.1
Workflow Scripting Machine Learning Statistics SQL Query
ZooKeeper 3.4.5
Coordination
YARN (MRv2)
Distributed Processing Framework
Log Collector
Flume 1.3.0
HDFS 2.0.3
Hadoop Distributed File System
Intel proprietary
Intel enhancements contributed back to open source All external names and brands are claimed as the property of others.
Open source components included without change
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 42
In-Memory Computing Reference Practice in the
Industry
Google PowerDrill
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 43
Stream Computing Reference Practice in the
Industry
• IBM InfoSphere Streams is one of the core components of IBM's big data
strategy, supports high-speed processing of structured and unstructured data,
processing data in motion, throughput of millions of events per second, high
expansibility, and the streams processing language (SPL).
IBM
• HStreaming conducted a streaming reconstruction on the Hadoop MapReduce
framework. The reconstructed Hadoop MapReduce framework is compatible
with the existing mainstream Hadoop infrastructures. The Hadoop MapReduce
framework processes data in streaming MapReduce mode under the premise of
making no / tiny changes on the framework. Gartner rated HStreaming as the
coolest ESP vendor. Now, the reconstructed framework supports text and video
processing using the Apache Pig language (that is, Pig Latin) and provides the
high scalability of Hadoop, throughput of millions of events per second, and
millisecond-level delay.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 44
Opportunities in the Big Data Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 45
Talents Required During the Development of Big Data
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 46
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 47
Huawei Big Data Platform Architecture
Application service layer
System
Hadoop API Plugin API management
Service
Hive MapReduce Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
• The Hadoop layer provides real-time data processing environment, which is enhanced based on the community open
source software.
• The DataFarm layer provides end-to-end data insight and builds the data supply chain from data to information,
knowledge, and wisdom, including Porter for data integration services, Miner for data mining services, and Farmer for
data service frameworks.
• Manager is a distributed management architecture. The administrator can control distributed clusters from a single
access point, including system management, data security management, and data governance.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 48
Core Capabilities of Huawei Big Data Team
Be able to establish
top-level projects that
Be able to take are adaptable to the
the lead in the eco-system in the
communities and communities
Be able to develop future-
independently oriented kernel
complete kernel- features
level development
for critical
service features
Be able to resolve
kernel-level
problems by team
Be able to resolve kernel-
level problems
(outstanding individuals)
Be able to locate
peripheral problems Large number of components and code
• Outstanding product development and delivery capabilities and carrier-class operation support capabilities empowered by the
Hadoop kernel team.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 47
49
Big Data Platform Partners from Finance and
Carrier Sectors
Industrial and
China Merchants Pacific Insurance
Commercial Bank China Mobile China Unicom
Bank (CMB) Co., Ltd. (CPIC)
of China (ICBC)
Top 3 50%
China Telco Top 10 Customers in China's
Financial Industry
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 50
Summary
These slides introduce:
• The big data era.
• Applications of big data in all walks of life.
• Opportunities and challenges brought by big data.
• Huawei big data solution.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 51
Quiz
1. Where is big data from? What are the features of big data?
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 52
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 53
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 57
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 58
Dictionary vs. File System
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 59
HDFS Overview
Hadoop distributed file system (HDFS) is developed based on Google file system
(GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems, HDFS also
provides the following features:
• High fault tolerance: resolves hardware unreliability problems.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 60
HDFS Overview
Hadoop distributed file system (HDFS) is developed based on Google file system
(GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems, HDFS also
provides the following features:
• High fault tolerance: resolves hardware unreliability problems.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 61
HDFS Application Scenarios
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 62
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 63
Position of HDFS in FusionInsight
Application service layer
System
Hadoop API Plugin API management
Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 64
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 65
Basic System Architecture
HDFS Architecture
Metadata (Name,replicas,…) :
NameNode /home/foo/data,3,…
Metadata ops
Block ops
Client
Replication
Blocks Blocks
Client
Rack 1 Rack 2
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 66
HDFS Data Write Process
2:create
1:create Distributed
HDFS NameNode
3:write FileSystem
Client 7:complete
Client node
4 4
DataNode DataNode DataNode
5 5
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 67
HDFS Data Read Process
Client node
5:read
4:read
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 68
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in FusionInsight HDFS System Key Features
Application Scenarios HD Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 69
Key Design of HDFS Architecture
NameNode / DataNode
Federation storage
in master / slave mode
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 70
HDFS High Availability (HA)
Heartbeat Heartbeat
EditLog
ZKFC JN JN JN ZKFC
Read log
Write log
FSlmage
Metadata NameNode synchronization NameNode
operation (Active) (Standby)
HDFS
Block operation
Client Data read write Heartbeat
Copy
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 71
Metadata Persistence
Editlog FSImage
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 72
HDFS Federation
APP Client-1 Client-k Client-n
Common Storage
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 73
Data Replication
Data Center Placement policy
Distance=4
Distance=4
Distance=0
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 74
Configuring HDFS Data Storage
Policies
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 75
Configuring HDFS Data Storage Policies - Layered
Storage
Configuring DataNode with layered storage :
• The HDFS layered storage architecture provides four types of storage devices: RAM_DISK (memory
virtualization hard disk), DISK (mechanical hard disk), ARCHIVE (high-density and low-cost storage
media), and SSD (solid state disk).
• Storage policies for different scenarios are formulated by combining the four types of storage devices.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 76
Configuring HDFS Data Storage Policies - Tag Storage
NameNode
/HBase T1 /Hive T1 T3
/Spark T2 /Flume T3
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 77
Configuring HDFS Data Storage Policies - Node Group
Storage
Rackgroup2
Rackgroup1 (mandatory) Rackgroup3 Rackgroup4
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 78
Colocation
T he definition of Colocation: is to store associated data or data that is going to be associated on the
same storage node.
According to the picture below, assume that file A and file D are going to be associated with each other,
which involves massive data migration. Data transmission consumes much bandwidth, which greatly
affects the processing speed of massive data and system performance.
NN
F
Aile A
A
A A
A B A
B C A
B D A
C D A File
A B
C D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 79
Colocation Benefits
T he HDFS colocation: is to store files that need to be associated with each other on the same data node so
that data does not have to be obtained from other nodes during associated computing. This greatly reduces
network bandwidth consumption.
When joining files A and D with colocation feature, resource consumption can be greatly reduced because the
blocks of multiple associated files are distributed on the same storage node.
NN
F
Aile A
A C A
A B A
B C A
B A
C A D File
A B
D D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 80
HDFS Data Integrity Assurance
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 81
Other Key Design Points of the HDFS Architecture
Space reclamation:
The recycle bin mechanism is provided and the number of replicas can be dynamically set.
Data organization:
Access mode:
Data can be accessed through Java APIs, HTTP, or shell commands.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 82
Common Shell Commands
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 28
83
Summary
This module describes the following information about HDFS: basic concepts,
application scenarios, technical architecture and its key features.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 84
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 85
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 86
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 90
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 91
MapReduce Overview
M apReduce is developed based on the paper issued by Google about MapReduce and is used for parallel
computing of a massive data set (larger than 1 TB) . It delivers the following highlights:
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 92
YARN Overview
A pache Hadoop YARN (Yet Another Resource Negotiator) is a new Hadoop resource manager. It
provides unified resource management and scheduling for upper-layer applications, remarkably
improving cluster resource utilization, unified resource management, and data sharing.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 93
Position of YARN in FusionInsight
Application service layer
System
Hadoop API Plugin API management
Service
Hive M/R Spark Streaming Flink governance
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 94
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 95
Working Process of MapReduce (1)
Before starting MapReduce, make sure that the files to be
processed are stored in HDFS.
MapReduce submits requests to ResourceManager. Then
ResourceManager creates jobs. One application maps to one Commit
job (example job ID: job_201431281420_0001).
Job.jar
Job.split
Job.xml
Before jobs are submitted, the files to be processed are split. By
After the jobs are submitted to ResourceManager, default, the MapReduce framework regards a block as a split.
Split
ResourceManager selects an appropriate NodeManager in the Client applications can redefine the mapping relation between
cluster to schedule ApplicationMasters based on the workloads blocks and splits.
of NodeManagers. The ApplicationMaster initializes jobs and
applies for resources from ResourceManager.
ResourceManager selects an appropriate NodeManager to start
the container for task execution. Map
The outputs of Map are placed to the buffer in memory. When
Buffer in the buffer overflows, data in the buffer needs to be written to
memory
local disks. Before that, the following process must be
1. Partition-By default, the hash algorithm is used for partitioning.
completed:
The MapReduce framework determines the number of partitions
based on that of Reduce tasks. The records with the same key Partition
value are sent to the same Reduce tasks for processing.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 96
Working Process of MapReduce (1)
Before starting MapReduce, make sure that the files to be
processed are stored in HDFS.
MapReduce submits requests to ResourceManager. Then
ResourceManager creates jobs. One application maps to one Commit
job (example job ID: job_201431281420_0001).
Job.jar
Job.split
Job.xml
Before jobs are submitted, the files to be processed are split. By
After the jobs are submitted to ResourceManager, default, the MapReduce framework regards a block as a split.
Split
ResourceManager selects an appropriate NodeManager in the Client applications can redefine the mapping relation between
cluster to schedule ApplicationMasters based on the workloads blocks and splits.
of NodeManagers. The ApplicationMaster initializes jobs and
applies for resources from ResourceManager.
ResourceManager selects an appropriate NodeManager to start
the container for task execution. Map
The outputs of Map are placed to the buffer in memory. When
Buffer in the buffer overflows, data in the buffer needs to be written to
memory
local disks. Before that, the following process must be
1. Partition-By default, the hash algorithm is used for partitioning.
completed:
The MapReduce framework determines the number of partitions
based on that of Reduce tasks. The records with the same key Partition
value are sent to the same Reduce tasks for processing.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 97
Working Process of MapReduce (2)
Reduce
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 98
Shuffle Mechanism
Shuffle
is the data transfer process between the Map phase and Reduce phase involves
obtaining MOF files from the Map tasks of Reduce tasks and sorting and merging
MOF files.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 99
Example: Typical Program WordCount
WordCount
2
App
Resource Name
Manager Node
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 100
Functions of WordCount
Input Output
Bye 3
Hello World Bye World
Hadoop 4
Hello Hadoop Bye Hadoop MapReduce
Hello 3
Bye Hadoop Hello Hadoop
World 2
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 101
Map Process of WordCount
<Hello,1>
<Hadoop,1>
02 “Hello Hadoop Bye Hadoop” Map <Bye,1>
<Hadoop,1>
<Bye,1>
03 “Bye Hadoop Hello Hadoop” <Hadoop,1>
Map
<Hello,1>
<Hadoop,1>
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 102
Reduce Process of WordCount
<Hello,1> <Hello,1>
<World,1> <World,2>
<Hello,1 1 1> Reduce Bye 3
<Bye,1> <Bye,1>
<World,1>
<Bye,1 1 1> Reduce Hadoop 4
<Hello,1> <Hello,1>
<Hadoop,1> Combine Shuffle
<Hadoop,2>
<Bye,1> <Bye,1>
<Hadoop,1> <World,2> Reduce Hello 3
<Bye,1>
<Bye,1>
<Hadoop,1>
<Hadoop,2>
<Hello,1> <Hadoop,2 2> Reduce World 2
<Hello,1>
<Hadoop,1>
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 103
Architecture of YARN
Node
Manager
client Node
Manager
Resource
Manager
client App Mstr Container
Node
MapReduce Status Manager
Job Submission
Node Status
Container Container
Resource Request
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 104
Task Scheduling of MapReduce on YARN
ResourceManager
1
Client
Applications Resource
Manager Scheduler
2
3 8
4
Container Container 7
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 105
YARN HA Solution
R esourceManager in YARN manages resources and schedules tasks in the cluster. The YARN HA solution
uses redundant ResourceManager nodes to solve single point of failure problem of ResourceManager.
ZooKeeper Cluster
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 106
YARN APPMaster Fault Tolerant Mechanism
Container
AM-1
AM-1
Container
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 107
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 108
Resource Management
yarn.nodemanager.resource.memory-mb
yarn.nodemanager.vmem-pmem-ratio
yarn.nodemanager.resource.cpu-vcore
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 109
Resource Allocation Model
Root
1.Selects a queue
Scheduler Parent Parent
2.Selects an application
from the queue App1 App 2 … App N
3.Matches requested
resources on the Server A
application
Server B
Rack A
Rack B
Any resources
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 110
Capacity Scheduler Overview
C apacity Scheduler enables Hadoop applications to run in a shared, multi-tenant cluster while maximizing the
throughput and utilization of the cluster.
C apacity Scheduler allocates resources by queue. Users can set upper and lower limits
for the resource usage of each queue. Administrators can restrict the resource used by a
queue, user, or job. Job priorities can be set but resource preemption is not supported.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 111
Highlights of Capacity Scheduler
• Capacity assurance: Administrators can set upper and lower limits for the resource usage of each
queue. All applications submitted to the queue share the resources.
• Flexibility: The remaining resources of a queue can be used by other queues that require resources. If a new
application is submitted to the queue, other queues release and return the resources to the queue.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 112
Task Selection by Capacity Scheduler
• Resources are allocated to the queue with the minimum queue hierarchy first. For example, for QueueA and QueueB. childQueueB, resources are
allocated to QueueA first.
• Resources are allocated to the resource reclamation request queue first.
A task is then selected from the queue based on the following policy:
• The task is selected based on the task priority and submission sequence as well as the limits of user resources and memory.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 113
Queue Resource Limitation (1)
Q ueues are created on the Tenant page. After a tenant is created and associated with YARN, a
queue with the same name as the tenant is created. For example, if tenants QueueA and QueueB
are created, two YARN queues QueueA and QueueB are created.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 114
Queue Resource Limitation (2)
Queue resource capacity (percentage) , there are three queues, default, QueueA, and QueueB, and each
has a [queue name]. capacity configuration:
The capacity of the default The capacity of the QueueA The capacity of the QueueB queue is
queue is 20% of the total cluster queue is 10% of the total cluster 10% of the total cluster resources. The
resources. resources. capacity of the root-default shadow
queue in the background is 60% of the
total cluster resources.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 115
Queue Resource Limitation (3)
Due to resource sharing, the resources used by a queue may exceed its
01 capacity (for example, QueueA.capacity) . The maximum resource usage
can be limited by parameter.
Sharing Idle
Resources If only a few tasks are running in a queue, the remaining resource of the
queue can be shared with other queues. For example, if maximum-capacity
02
of QueueA is set to 100 and tasks are running in QueueA only, QueueA can
use all the cluster resources theoretically.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 116
User and Task Limitation
L og into FusionInsight Manager and choose Tenant > Dynamic Resource Plan > Queue
Config to configure user and task limitation parameters.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 117
User Limitation (1)
Minimum resource assurance (percentage) of a user:
• The resources for each user in a queue are limited at any time. If tasks of multiple users are running at the same time in a queue, the resource usage of each user fluctuates
between the minimum value and the maximum value. The maximum value is determined by the number of running tasks, while the minimum value is determined by minimum-
user-limit-percent.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 30
118
User Limitation (2)
Indicates the multiples of queue capacity. This parameter is used to set the
resources that can be obtained by a user, with a default of 1:
yarn.scheduler.capacity.root.QueueD.user-limit- factor=1, indicating that
the resource capacity obtained by a user cannot exceed the queue
capacity. No matter how many free resources a cluster has, the resource
capacity that can be obtained by a user cannot exceed maximum-
capacity.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 119
Task Limitation
01 02 03
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 120
Queue Information
Choose Services > YARN > ResouceManager (active) > Scheduler to view queue information.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 121
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 122
Enhanced Features - YARN Dynamic Memory Management
No
Containers with
NM MEM Thrshold = excessive memory
yarn.nodemanager.resource.memory-mb*1024*1024* usage cannot run.
yarn.nodemanager.dynamic.memory.usage.threshold
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 123
Enhanced Features - YARN Label - based
Scheduling
Applications that have Applications that have
Applications that have
common resource demanding memory
demanding I / O requirements
requirements requirements
NodeManager
NodeManager NodeManager
Queue
Tasks
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 124
Summary
This module describes the following information: application scenarios and architectures of MapReduce and YARN,
Resource management and task scheduling of YARN, and enhanced features of YARN in FusionInsight HD.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 125
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 126
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 127
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 128
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 129
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 130
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 131
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 135
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 136
Spark Introduction
• Apache Spark is a fast, versatile, and scalable in-memory big data computing engine.
• Apache Spark is a one-stop solution that integrates batch processing, real-time stream processing,
interactive query, graph computing, and machine learning.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 137
Application Scenarios
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 138
Spark Highlights
Light Fast
• Spark core code has • Delay for small
30,000 lines. 01 02 datasets reaches the
sub-second level.
Spark
Flexible Smart
• Spark offers different 04 03 • Spark smartly uses
levels of flexibility. existing big data
components.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 139
Spark Ecosystem
Spark
Hadoo
Hive HBase
p
Mahout Docker Elastic
Flume Mesos Kafka
MySQL
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 140
Spark vs MapReduce (1)
Iter.2
Iter.1 Iter.2 … Iter.1 Iter.2 …
Input Input
One-time
processing
HDFS Query 1 Result 1 Query 1 Result 1
Read
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 141
Spark vs MapReduce (2)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
142
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 143
Spark System Architecture
Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming
Spark Core
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 144
Spark System Architecture
Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming
Spark Core
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 145
Core Concepts of Spark - RRD
• Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed datasets.
• RDDs are stored in memory by default and are written to disks when the memory is insufficient.
• RDD data is stored in the cluster as partitions.
• RDD has a lineage mechanism (Spark Lineage), which allows for rapid data recovery when data loss occurs.
RDD1 RDD2
Hello Spark
Hello
“Hello Spark” “Hello, Spark”
Hadoop
“Hello Hadoop” “Hello, Hadoop”
China Mobile
“China Mobile” “China, Mobile”
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 146
RDD Dependencies
groupByKey
map, filter
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 147
Stage Division of RDD
A: B:
G:
Stage1 groupby
C: D: F:
map join
E:
Stage2 union
Stage3
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 148
RDD Operators
Transformation
• Transformation operators are invoked to generate a new RDD from one or more existing RDDs.
Such an operator initiates a job only when an Action operator is invoked.
• Typical operators: map, flatMap, filter, and reduceByKey.
Action
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 149
Major Roles of Spark (1)
Driver
Responsible for the application business logic and operation planning (DAG).
ApplicationMaster
Manages application resources and applies for resources from ResourceManager as needed.
Client
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 150
Major Roles of Spark (2)
ResourceManager
NodeManager
Executor
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 151
Spark on YARN - Client Operation Process
Driver
ResourceManager
1. Submit an application
Spark on YARN-Client
YARNClientScheduler ApplicationMaster
Backend
NodeManager
NodeManager
Container
4. Start the container
Executor
ExecutorLauncher
Cache Task
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 152
Spark on YARN - Cluster Operation Process
NodeManager
Spark on YARN-Client Container
4. Driver assigns tasks
1. Submit Executor
an application
the application
resources for NodeManager
2. Allocate
Container
Container
Executor
ApplicationMaster
Cache
(including Driver)
DAGScheduler Task
3. Apply for
Executor from Task
ResourceManager YARNClusterScheduler
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 153
Differences Between YARN - Client and YARN -
Cluster
Differences
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 154
Typical Case - WordCount
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 155
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 156
Spark SQL Overview
• Spark SQL is the module used in Spark for structured data processing. In Spark
applications, you can seamlessly use SQL statements or DataFrame APIs to query
structured data.
Cost Model
Unresolved Optimized Selected
DataFrame Logical Plan Physical Plans RDDs
Logical Plan Logical Plan Physical Plan
Catalog
Dataset
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 157
Introduction to Dataset
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 158
Introduction to DataFrame
D ataFrame is a dataset with specified column names. DataFrame is a special case of Dataset [Row].
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 159
RDD, DataFrame, and Datasets (1)
RDD: DataFrame:
• Advantages: safe type • Advantages: schema
and object oriented. information to reduce
• Disadvantages: high serialization and
performance overhead deserialization overhead.
for serialization and • Disadvantages: not object-
deserialization; high oriented; insecure during
GC overhead due to compiling.
frequent object creation
and deletion.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 27
160
RDD, DataFrame, and Datasets (2)
Characteristics of Dataset:
• Fast: In most scenarios, performance is superior to RDD. Encoders
are better than Kryo or Java serialization, avoiding unnecessary
format conversion.
• Secure type: Similar to RDD, functions are as secure as possible
during compiling.
• Dataset, DataFrame, and RDD can be converted to each other.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 161
Spark SQL and Hive
• The execution engine of Spark SQL is Spark Core, and the default
execution engine of Hive is MapReduce.
Differences • The execution speed of Spark SQL is 10 to 100 times faster than Hive.
• Spark SQL does not support buckets, but Hive does.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 162
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 163
Structured Streaming Overview (1)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 164
Structured Streaming Overview (2)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 165
Programming Model for Structured Streaming
Input Data up
to t=1
Data up
to t=2
Data up
to t=3
Query
Output
complete mode
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 166
Example Programming Model of Structured
Streaming
nc
Cat dog Dog
> _ Owl cat
Dog dog owl
1 2 3
Time
Cat 2 Cat 2
Computing results Cat 1
Dog 3 Dog 4
Dog 3
Owl 1 Owl 2
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 167
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 168
Overview of Spark Streaming
S park Streaming is an extension of the Spark core API, which is a real-time computing framework featured with
scalability, high throughput, and fault tolerance.
HDFS
Kafka
Spark Kafka
HDFS Streaming
Database
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 169
Mini Batch Processing of Spark Streaming
S park Streaming programming is based on DStream, which decomposes streaming programming into a
series of short batch jobs.
batches of
input data stream Spark batches of input data Spark processed data
Streaming Engine
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 170
Fault Tolerance Mechanism of Spark Streaming
S park Streaming performs computing based RDDs. If some partitions of an RDD are lost, the lost partitions
can be recovered using the RDD lineage mechanism.
Interval
[0,1)
map reduce
Interval
[1,2)
…
…
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 171
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 172
Spark WebUI
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 173
Spark and Other Components
Ifollowing
n the FusionInsight cluster, Spark interacts with the
components:
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 174
Summary
• The background, application scenarios, and characteristics of Spark are briefly introduced.
• Basic concepts, technical architecture, task running processes, Spark on YARN, and application scheduling
of Spark are introduced.
• Integration of Spark in FusionInsight HD is introduced.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 175
Quiz
• What are the differences between wide dependencies and narrow dependencies of
Spark?
• What are the application scenarios of Spark?
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 176
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 177
More Information
• eLearning course:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 178
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 182
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 183
HBase Overview
• HBase is suitable for storing big table data (which contains billions of rows and millions of columns) and allows real-time
data access.
• HBase uses HDFS as the file storage system to provide a distributed column-oriented database system that allows real-
time data reading and writing.
• HBase uses ZooKeeper as the collaboration service.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 184
HBase vs. RDB
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 185
Data Stored By Row
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 186
Data Stored by Column
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 187
HBase vs. RDB
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 188
Application Scenarios of HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 189
Position of HBase in FusionInsight
Application service layer
Open API / SDK REST / SNMP / Syslog
System
Hadoop API Plugin API management
Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 190
KeyValue Storage Model (1)
• KeyValue has a specific structure. Key is used to quickly query a data record, and Value is used to store user data.
• As a basic user data storage unit, KeyValue must store some description of itself, such as timestamp and type information. This
requires some structured space.
• Data can be expanded dynamically, adaptive to changes of data types and structures. Data is read and written by block. Different
Columns are not associated, so are tables.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 191
KeyValue Storage Model (2)
Partition mode of a KeyValue Database-based on continuous Key range.
• Data subregions are created based on the RowKey range (sorting based on a sorting algorithm such as the alphabetic order
based on RowKeys). Each subregion is a basic distributed storage unit.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 192
KeyValue Storage Model (3)
• The underlying data of HBase exists in the form of KeyValue. KeyValue has a specific format.
• KeyValue contains key information such as timestamp and type, etc.
• The same key can be associated with multiple Values. Each KeyValue has a qualifier.
• There can be multiple KeyValues associated with the same Key and Qualifier. In this case, they are distinguished
using timestamps. This is why there are multiple versions of the same data record.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 193
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 194
HBase Architecture (1)
HRegionServer HRegionServer
HRegion HRegion
HBase
HLog
StoreFile StoreFile … StoreFile … … … StoreFile StoreFile … StoreFile … … …
HFile HFile HFile HFile HFile HFile
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 195
HBase Architecture (2)
• Store: A Region consists of one or multiple Stores. Each store corresponds to a Column
RegionServer
Family.
• MemStore: A Store contains one MemStore. Data inserted to a Region by client is cached HLog
to the MemStore.
Region
• StoreFile: The data flushed to the HDFS is stored as a StoreFile in the HDFS.
Store MemStore
• HFile: HFile defines the storage format of StoreFiles in a file system. HFile is underlying
implementation of StoreFile. StoreFile StoreFile
… … …
• HLog: HLogs prevent data loss when a RegionServer is faulty. Multiple Regions in a HFile HFile
RegionServer share the same HLog.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 196
HMaster (1)
"Hey, Region A, please move to RegionServer 1!"
"RegionServer 2 was gone! Let others take over
it's Regions!"
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 197
HMaster (2)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 198
RegionServer
Region
• RegionServer is the data service process of HBase and is responsible for
processing reading and writing requests of user data.
RegionServer
Region
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 199
Region (1)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 200
Region (2)
Row001 Row001
Row002 Region-1
Row002
……….. StartKey, EndKey
……….. Row010
Row010
Row011
Row011 Row012 Region-2
Row012 ……….. StartKey, EndKey
……….. Row020
Row020 Row021
Row021 Row022 Region-3
……….. StartKey, EndKey
Row022
Row030
………..
Row031
Row030
……….. Region-4
Row031 ……….. StartKey, EndKey
……….. ………..
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 201
Region (3)
META Region
Search for the address of the User Regions in the Meta Region.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 202
Column Family
/region-2/ColumnFamily-1
/region-2/ColumnFamily-2
/HBase/table
/region-1
/region-3/ColumnFamily-1
/region-2
/region-3/ColumnFamily-2
/region-3
HDFS
• A ColumnFamily is a physical storage unit of a Region. Multiple column families of the same Region have different paths in HDFS.
• ColumnFamily information is table-level configuration information. That is, multiple Regions of the same table have the same column family information. (For
example, each Region has two column families and the configuration information of the same column family of different Regions is the same) .
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 203
ZooKeeper
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 204
MetaData Table
User Table 1
HBase: meta
…
…
…
• The MetaData Table HBase: Meta stores the information about Regions to
locate the Specific Region for Client. …
…
• The MetaData Table is into multiple Regions,
…
• and MetaData information of Region is stored
…
User Table N
in ZooKeeper. …
…
…
Mapping relation
MetaData Table
…
User table
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 205
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 206
Writing Process
RegionServer (on the first floor)
Region 3 Palaeontology
Region 3 Region 4
Region 4 Physiochemistry
Region 5
Region 5 Biophysics
Book classification
storage
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 207
Client Initiating a Data Writing Request
Client
• The process of initiating a writing request by a client is like sending books to a library
by a book supplier. The book supplier must determine to which building and floor the
books should be sent.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 208
Writing Process - Locating a Region
1 Hi, META. Please send the bookshelf number, book number scope
(Rowkeys included in each Region), and information about the
floors where the bookshelves are located (RegionServers to which
the Regions belong) to me.
Bookshelf number
Region Book number
Rowkey
scope
Rowkey 070 Rowkey 071 Rowkey 072
Palaeontology
Region 3
Rowkey 075 Rowkey 006 Rowkey 078
Floor information
Regionserver
HClient
META
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 209
Writing Process - Grouping Data (1)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 210
Writing Process - Grouping Data (2)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 211
Writing Process - Sending a Request to a RegionServer
• After sending a data writing request, a client waits for the request processing result.
RegionServer • If the client does not capture any exception, it deems that all data has been written successfully. If
writing the data fails completely or partially, the client can obtain a detailed KeyValue list relevant to
the failure.
RegionServer
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 212
Writing Process - Process of Writing Data to a Region
RS1 RS2
path
Q8~Q1 Q8~Q1
Q6~Q7 0 Q6~Q7 0
Q11~Q12 Q11~Q12
RS5
I have stored the books in
sequence according to the book
number information provided by
HClient .
Q1~Q3 Q4~Q5
Q8~Q1
Q6~Q7 0
Q11~Q12
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 213
Writing Process - Flush
MemStore1
ColumnFamily-1 HFile
Region
MemStore-2
ColumnFamily-2 HFile
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 214
Impacts of Multiple HFiles
25
20
Read latency,ms.
15
10
0
0 3600 7200 10800 14400
As time passes by, the number of HFiles increases and a query request will take much more
time.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 215
Compaction (1)
Compaction aims to reduce the number of small files in a column family in a Region,
thereby increasing reading performance.
• Minor: compaction covering a small range. Minimum and • Major: compaction covering the HFiles in a column family in a
maximum numbers of files are specified. Small files at a Region. During major compaction, deleted data is cleared.
consecutive time duration are combined.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 216
Compaction (2)
Write
put MemStore
Flush
Minor Compaction
Major Compaction
HFile
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 217
Region Split
Parent Region
• A common Region splitting operation is performed to split a Region into
two subregions if the data size of the Region exceeds the predefined
threshold.
• During splitting, the split Region suspends the reading and writing
services. During splitting, data files of the parent Region are not split
and rewritten to the two subregions. Reference files are created in the
new Region to achieve quick splitting. Therefore, services of the Region
are suspended only for a short time.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 218
Reading Process
RegionServer (on the first floor)
Region 3 Palaeontology
Region 3 Region 4
Region 4 Physiochemistry
Region 5
Region 5 Biophysics
Book classification
storage
Rowtey 001 Rowtey 002 Rowtey 003 Rowkey 001 Rowkey 002 Rowkey 003
Rowtey 006 Rowtey 007 Rowtey 009 Rowkey 006 Rowkey 007 Rowkey 009
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 219
Client Initiating a Data Reading Request
Get
Scan
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 220
Locating a Region
1
Hi, META, I want to look for books whose code ranges is from
xxx to xxx, please find the bookshelf number and the floor
information within the code range.
Bookshelf number
Region Book number
Rowkey
scope
Rowkey 070 Rowkey 071 Rowkey 072
Palaeontology
Region 3
Rowkey 075 Rowkey 006 Rowkey 078
Floor information
Regionserver
HClient
META
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 221
OpenScanner
ColumnFamily-1
MemStore
• HFile-11
• HFile-12
Region
ColumnFamily-2
MemStore
• HFile-21
• HFile-22
D uring the OpenScanner process, scanners corresponding to MemStore and each HFile are
created:
• The scanner corresponding to HFile is StoreFileScanner.
• The scanner corresponding to MemStore is MemStoreScanner.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 222
Filter
• SingleColumnValueFilter
• KeyOnlyFilter
• FilterList
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 223
BloomFilter
BloomFilter is used to optimize scenarios where data is randomly read, that is,
scenarios where the Get operation is performed. It can be used to quickly check
whether a piece of user data exists in a large dataset (most data in the dataset
cannot be loaded to the memory).
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 224
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 225
Supporting Secondary Index
• The secondary index enables HBase to query data based on specific column values.
04 …… Wuhan 28 6812645 ……
05 …… Changsha 26 6889763 ……
06 …… Jinan 35 6854912 ……
• When the secondary index is not used, the mobile field needs to be matched in the entire table by row to search for
specified mobile numbers such as “68XXX” which results in long time delay.
• When the secondary index is used, the index table is searched first to identify the location of the mobile number,
which narrows down the search scope and reduces the time delay.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 47
226
HFS
• HBase FileStream (HFS) is a separate module of HBase. As an encapsulation of HBase and HDFS interfaces, HFS
provides capabilities, such as storing, reading and deleting files for upper-level applications.
• HFS provides the ability of storing massive small files and large files in HDFS. That is, massive small files (less than
10MB) and some large files (larger than 10MB) can be stored in HBase.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 227
HBase MOB (1)
MOB Data (100KB to 10MB) is directly stored in the file system (HDFS for
example) as HFile. And the information about address and size of file is stored in
HBase as a value. With tools managing these files, the frequency of compation
and split can be highly reduced, and performance can be improved.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 228
HBase MOB (2)
HRegionServer HRegionServer
HLog
… MOB … MOB …
Store Store Store … Store …
HDFS
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 229
Summary
This module describes the following information about HBase: KeyValue Storage Model, technical architecture,
reading and writing process and enhanced features of FusionInsight HBase.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 230
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 231
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 232
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 233
More Information
• eLearning course:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 234
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 237
Hive application scenarios and basic
principles A
Objectives
Upon completion of this course, you will be able
Enhanced features of FusionInsight Hive
B
to know:
Common Hive SQL statements
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 238
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 239
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 240
Hive Overview
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 241
Hive Overview
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 242
Application Scenarios of Hive
Data mining
Data aggregation
• Interest analysis
• Daily / Weekly click count
• User behavior analysis
• Traffic statistics
• Partition demonstration
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 243
Position of Hive in FusionInsight
Application service layer
System
Hadoop API Plugin API management
Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 244
Comparison Between Hive and Traditional Data
Warehouses (1)
Hive Traditional Warehouse
Execution An algorithm with higher efficiency can be used to query data. More optimization
MapReduce /Tez / Spark
engine measures can be taken to improve the efficiency.
Usage HQL (similar to SQL) SQL
Metadata and data are stored separately for
Flexibility Low. Data is used for limited purposes.
decoupling.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 9245
Comparison Between Hive and Traditional Data
Warehouses (2)
Hive Traditional Data Warehouses
Low efficiency. It has not met expectations
Index currently. Efficient.
An application model must be developed. This It provides a set of well-developed report solutions to facilitate data
Usability
results in high flexibility but low usability. analysis.
Data is stored in HDFS, which features high It has relatively low reliability. When a query attempt fails, the query must
Reliability
reliability and high fault tolerance. be restarted. Data fault tolerance relies on hardware RAID.
Environment
It can be deployed using common computers. It requires high-performance commercial servers.
dependence
Price Open-source product. The data warehouses used for commercial purposes are expensive.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
246
Advantages of Hive
Advantages of Hive
1 2 3 4
High reliability
and tolerance SQL-like query Scalability Multiple interfaces
• HiveServer in • SQL-like syntax • User defined storage • Beeline
cluster mode • Built-in functions in format • JDBC
• Dual-MetaStore large quantity • User defined • Thrift
• Query retry after functions (UDF / • Python
timeout UDAF / UDTF) • ODBC
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 247
Disadvantages of Hive
Disadvantages of Hive
1 2 3 4
Not support Inapplicable to Not support
High latency materialized OLTP storage process
• MapReduce • Does not support • Does not support • Does not support
execution engine materialized views. column-level data storage process,
by default. • Data updating, adding, updating, but supports logic
• High latency of insertion, and deletion and deletion. processing using
MapReduce. cannot be performed UDF.
on views.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 248
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 249
Hive Architecture
Hive
JDBC ODBC
Web
Command Line Interface Thrift Server
Interface
Driver
MetaStore
(Compiler, Optimizer, Executor)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 250
Hive Architecture in FusionInsight HD
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 251
Architecture of WebHCat
WebHCat provides Rest interface for users to make the following operations through safe
HTTPS protocol:
• Hive DDL operations
• Running Hive HQL task
• Runing MapReduce task
WebHCat
Server (s) HCat
also known as
Templeton DDL
Server (s)
Queue
Pig. Hive
WebHCat HDFS
MapReduce
Job_ID
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 252
Data Storage Model of Hive
Database
Table Table
Partition
Bucket
Bucket
Skewed Normal
Partition
Bucket
Bucket
data data
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 253
Data Storage Model of Hive - Partition and
Bucket
Partition: A data table can be divided into partitions
by using a field value.
• Each partition is a directory.
• The number of partitions is configurable.
• A partition can be partitioned or bucketed.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 254
Data Storage Model of Hive - Partition and
Bucket
Partition: A data table can be divided into partitions
by using a field value.
• Each partition is a directory.
• The number of partitions is configurable.
• A partition can be partitioned or bucketed.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 255
Data Storage Model of Hive - Managed Table and
External Table
Hive can create managed table and external table:
• Managed tables are created by default and managed by Hive. In this case, Hive migrates data to data warehouse directories.
• When external tables are created, Hive access data from locations outside data warehouse directories.
• Use managed tables when Hive performs all operations.
• Use external tables when Hive and other tools share the same data set for different processing.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 256
Data Storage Model of Hive - Managed Table and
External Table
Hive can create managed table and external table:
• Managed tables are created by default and managed by Hive. In this case, Hive migrates data to data warehouse directories.
• When external tables are created, Hive access data from locations outside data warehouse directories.
• Use managed tables when Hive performs all operations.
• Use external tables when Hive and other tools share the same data set for different processing.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 257
Functions of Hive
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 258
Enhanced Features of Hive - Colocation
Overview
• Colocation: storing associated data or data on which associated operations are performed on
the same storage node.
• File-level Colocation allows quick file access. This avoids network consumption caused by
data migration.
NN #1
A C D A B D B C B C A D
DN #1 DN #2 DN #3 DN #4 DN #5 DN #6
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 259
Enhanced Features of Hive - Using Colocation
CREATE TABLE tbl_2 (id INT, name STRING) row format delimited
fields terminated by '\t' stored as TEXTFILE
TBLPROPERTIES("groupId"="group1","locatorId"="locator1");
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 260
Enhanced Features of Hive - Encrypting
Columns
Step 1: When creating a table, specify the columns to be
encrypted and the encryption algorithm.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 261
Enhanced Features of Hive - Deleting HBase
Records in Batches
Overview:
• In FusionInsight HD, Hive allows deletion of a single record from an HBase table. Hive can use specific syntax to delete one or more data
records that meet criteria from its HBase tables.
Usage:
• To delete some data from an HBase table, run the following HQL statement:
here, expression indicates the criteria for selecting the data to be deleted.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 262
Enhanced Features of Hive - Controlling Traffic
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 25
263
Enhanced Features of Hive -
Specifying Row Delimiters
Step 1: Set inputFormat and outputFormat when creating a table.
set hive.textinput.record.delimiter=“!@!“;
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 264
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 265
Hive SQL Overview
010101010101
010101010101 DML-Data manipulation language
010101010101 • Data import, export.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 266
Hive Basic Operations (1)
Data format example: 1,huawei,1000.0
• Create managed table.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 267
Hive Basic Operations (2)
• Modify a column.
ALTER TABLE employee1 CHANGE money string COMMENT 'changed
by alter' AFTER dateincompany;
• Add a column.
ALTER TABLE employee1 ADD columns(column1 string);
• Describe table.
DESC table_a;
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 268
Hive Basic Operations (3)
• Load data from the local.
LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE
example.employee;
• Insert data.
INSERT INTO TABLE company.person
SELECT id, name, age, birthday FROM company.person_tmp
WHERE century= '23' AND year='2010';
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 269
Hive Basic Operations (4)
• WHERE.
SELECT id, name FROM employee WHERE salary >= 10000;
• GROUP BY.
SELECT department, avg(salary) FROM employee GROUP BY department;
• UNION ALL.
SELECT id, salary, date FROM employee_a UNION ALL
SELECT id, salary, date FROM employee_b;
• JOIN.
SELECT a.salary, b.address FROM employee a JOIN employee_info
b ON a.name=b.name;
• Subquery.
SELECT a.salary, b.address FROM employee a JOIN (SELECT
address FROM employee_info where province='zhejiang') b ON
a.name=b.name;
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 270
Summary
This module describes the following information about Hive: basic principles, application scenarios, enhanced
features in FusionInsight and common Hive SQL statements.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 271
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 272
Quiz
• Which of the following statements about Hive SQL operations are correct?
A. The keyword external is used to create an external table and the key word internal is used to create a common
table.
B. Specify the location information when creating an external table.
C. When data is uploaded to Hive, the data source must be one HDFS path.
D. When creating a table, column delimiters can be specified.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 273
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 274
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 278
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 279
Streaming Overview
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 280
Application Scenarios of Streaming
Real-time log processing and Real-time website access statistics Real-time advertisement
vehicle traffic analysis and sorting positioning and event
marketing
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 281
Position of Streaming in FusionInsight
Application service layer
System
Hadoop API Plugin API management
Service
Hive M/R Spark Streaming Flink governance
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 282
Comparison with Spark Streaming
t1
tn
Generate RDD and start
Spark
r1 r2 …
Task Scheduler Spark batch jobs
to execute RDD
batches of results Memory Manager transformations
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 283
Comparison of Application Scenario
Real-time Performance
Streaming
Spark Streaming
Time
milliseconds seconds
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 284
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 285
Basic Concepts (1)
Topology
A real-time application in Streaming.
Nimbus
Assigns resources and schedules tasks.
Supervisor
Receives tasks assigned by Nimbus, and starts/stops Worker processes.
Worker
Runs component logic processes.
Spout
Generates source data flows in a topology.
Bolt
Receives and processes data in a topology.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 286
Basic Concepts (1)
Topology
A real-time application in Streaming.
Nimbus
Assigns resources and schedules tasks.
Supervisor
Receives tasks assigned by Nimbus, and starts/stops Worker processes.
Worker
Runs component logic processes.
Spout
Generates source data flows in a topology.
Bolt
Receives and processes data in a topology.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 287
Basic Concepts (2)
Task
A Spout or Bolt thread of Worker.
Tuple
Core data structure of Streaming. It is basic message delivery unit in key-
value pairs, which can be created and processed in a distributed way.
Stream
An infinite continuous Tuple sequence.
Zookeeper
Provides distributed collaboration services for processes. Active / Standby Nimbus,
Supervisor, and Worker register their information in ZooKeeper. This enables Nimbus to
detect the health status of all roles.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 288
System Architecture
Worker Executor
Worker Reports the heartbeat.
Worker Executor
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 289
Topology
• A topology is a directed acyclic graph (DAG) consisting of Spout (data source) and Bolt (for logical
processing). Spout and Bolt are connected through Stream Groupings.
• Service processing logic is encapsulated in topologies in Streaming.
Filters data
Obtains stream data
from external data
sources Bolt A Bolt B
Persistent
archiving
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 290
Worker
Worker
A Worker is a JVM process and a topology runs in one or more
Workers. A started Worker runs all the way unless manually Worker Process
stopped. The number of Worker processes depends on the
topology setting, and has no upper limit. The number of Worker
processes that can be scheduled and started depends on the
Executor
number of slots configured in Supervisor.
Executor
Task
Executor
Task
In a Worker process runs one or more Executor threads.
Each Executor can run one or more task instances of either
Spout or Bolt. Executor
Task
Task Task
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 291
Task
B oth Spout and Bolt in a topology support concurrent running. In the topology, you can specify
the number of concurrently running tasks on each node. Streaming assigns tasks in the cluster to
enable simultaneous calculation and enhance processing capability of the system.
Stream Bolt A
Grouping Bolt B
Spout
Bolt C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 292
Message Delivery Policies
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
293
Message Delivery Policies
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
294
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 295
Nimbus HA
ZooKeeper cluster
Streaming cluster
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 296
Disaster Recovery
S ervices are automatically migrated from faulty nodes to normal ones, preventing
service interruptions.
Zero
manual
operation
Node1 Node2 Node3
Topo1 Topo1 Topo1
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 297
Message Reliability
Reliability Processing
Description
Level Mechanism
This mode involves the highest throughput and applies to messages with
At Most Once None
low reliability requirements.
This mode involves low throughput and applies to messages with high
At Least Once Ack
reliability problems. All data must be completely processed.
Trident is a special transactional API provided by Storm and involves the
Exactly Once Trident
lowest throughput.
W hen a tuple is completely processed in Streaming, the tuple and all its derived tuples are successfully
processed. A tuple fails to be processed if the processing is not complete within the timeout period.
B
B
A D
A
C
C
E
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 298
Ack Mechanism
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 299
Reliability Level Setting
If not every message is required to be processed (allowing some message loss), the
reliability mechanism can be disabled to ensure better performance.
• Setting Config. • Using Spout to send messages • Using Bolt to send messages in
TOPOLOGY_ACKERS to 0. through interfaces that do not restrict Unanchor mode.
message IDs.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 300
Streaming and Other Components
Kafka
Topic N Topology N
……
External components such as HDFS and HBase are integrated to facilitate real-time
offline analysis.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 301
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 302
StreamCQL Overview
S treamCQL (Stream Continuous Query Language) is a query language based on the distributed stream
processing platform based on and can be built on various stream processing engines (mainly Apache Storm).
Currently, most stream processing platforms provide only distributed processing capabilities but involve complex
service logic development and poor stream computing capabilities. The development efficiency is low due to low reuse
and repeated development. StreamCQL provides various distributed stream computing functions, including traditional
SQL functions such as filtering and conversion, and new functions such as stream-based time window computing,
window data statistics, and stream data splitting and merging.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 26
303
StreamCQL Easy to Develop
//Def Input:
public void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector) {…} StreamCQL
public void nextTuple() {…}
public void ack(Object id) { …}
public void --Def Input:
declareOutputFields(OutputFieldsDeclarer CREATE INPUT STREAM S1 …
declarer) {…}
//Def logic:
public void execute(Tuple tuple,
BasicOutputCollector collector) {…} --Def logic:
public void INSERT INTO STREAM filterstr SELECT *
declareOutputFields(OutputFieldsDeclarer FROM S1 WHERE name="HUAWEI";
ofd) {…}
//Def Output:
public void execute(Tuple tuple, --Def Output:
BasicOutputCollector collector) {…} CREATE OUTPUT STREAM S2…
public void
declareOutputFields(OutputFieldsDeclarer
ofd) {…} --Def Topology:
//Def Topology: SUBMIT APPLICATION test;
public static void main(String[] args) throws
Exception {…}
Function
Join Aggregate Split Merge Pattern Matching
Stream Window
Engine
Storm Other stream processing engines
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 305
Summary
This module describes the following information about Streaming:
• Definition
• Application Scenarios
• Position of Streaming in FusionInsight
• System architecture of Streaming
• Key features of Streaming
• Introduction to StreamCQL
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 306
Quiz
How is message reliability guaranteed in
Streaming?
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 307
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 308
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 309
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 310
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 314
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture FusionInsight HD
of Flink
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 315
Flink Overview
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 316
Key Features of Flink
Flink
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 6317
Key Features of Flink
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 318
Application Scenarios of Flink
Typical scenarios:
• Internet finance services.
• Clickstream log processing.
• Public opinion monitoring.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 319
Hadoop Compatibility
red
map join
join
Flink supports YARN and can obtain data from the Hadoop distributed file system (HDFS) and HBase.
Flink supports the Mappers and Reducers of Hadoop, which can be used together with Flink operations.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 320
Performance Comparison of Stream Computing
Frameworks
Storm & Flink Identity Single-Thread Throughput
400000
350466.22
350000
Throughput (pieces/s)
300000 277792.60
250000
200000
150000
87729.76
100000 76519.48
50000
0
1 partition source 8 partition source
Storm Flink
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 321
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture FusionInsight HD
of Flink
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 322
Flink Technology Stack
Machine Learning
Graph Processing
Event Processing
Relational
Relational
FlinkML
APIs & Libraries
Table
Table
Gelly
CEP
`
Core
Runtime
Distributed Streaming Dataflow
Deploy
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 323
Core Concept of Flink - DataStream
D ataStream: Flink uses data streams to represent DataStream in applications. DataStream can be considered as
an unchangeable collection of duplicate data. The number of DataStream elements is unlimited.
windowAll ( )
CoGroupedStreams
DataStream AllWindowedStream
reduce ( )
window (…). apply (…) fold ( )
sum ( )
max ( )
coGroup (DataStream)
…
keyBy ( )
KeyedeStream
window ( )
WindowedStream
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 324
DataStream
Data source: indicates the streaming data source, which can be HDFS files, Kafka
data, or texts.
Data sink: indicates data output, which can be HDFS files, Kafka data, or texts.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 325
Data Source of Flink
Files Files
• HDFS, local file system, and Socket streams
MapR file system Kafka
• Text, CSV, Avro, and RabbitMQ
Hadoop input formats
Flume
JDBC Collections
HBase Implement your own
Collections • SourceFunction. collect
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 326
DataStream Transformations
Common
transformations
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 327
DataStream Transformations
flatMap
writeAsText
HDFS 1 3 Window / Join
HDFS
textFile
map keyBy 6
2 4 5
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 328
Flink Application Running Process - Key Roles
• Indicates the request initiator, which submits application requests and creates the
Client data flow.
ResourceManager • Indicates the resource management department, which schedules and allocates the
of YARN resources of the entire cluster in a unified manner.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 329
Flink Job Running Process
(Worker) (Worker)
TaskManager TaskManager
Task Task Task Task Task Task
Slot Slot Slot Slot Slot Slot
Task Task Task Task
Checkpoint
Coordinator
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 330
Flink on YARN
YARN Resource
Manager
2.Register resources
and request AppMaster
container
3.Allocate AppMaster Container
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 331
Technical Principles of Flink (1)
Stream
Transformation
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 332
Technical Principles of Flink (2)
T he source operator is used to load streaming data. Transformation operators, such as map ( ), keyBy ( ),
and apply ( ), are used to process streaming data. After streaming data is processed, the sink writes the
processed streaming data into related storage systems, such as HDFS, HBase, and Kafka.
keyBy (
Source map ( ) ) Sink
apply ( )
Stream
Streaming Dataflow
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 333
Parallel DataStream of Flink
Streaming Dataflow (condensed view)
keyBy ( )
Source map ( ) Sink
apply ( )
Operator Stream
Source map ( )
[1] keyBy ( )
[1]
apply ( )
[1]
Operator Stream Sink
Subtask Partition parallelism = 2 [1]
keyBy ( )
Source map ( ) parallelism = 1
apply ( )
[2] [2]
[2]
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 334
Operator Chain of Flink
Streaming Dataflow (condensed view)
keyBy()
Source map() apply()
[1] [1] [1]
Sink
Subtask (=thread) Subtask (=thread) [1]
keyBy()
Source map()
apply()
[2] [2]
[2]
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 335
Windows of Flink
F link supports operations based on time windows and operations based on data windows.
• Categorized by splitting standard: time windows and count windows.
• Categorized by window action: tumbling windows, sliding windows, and custom windows.
Event
Time windows
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 336
Common Window Types of Flink (1)
user2
user3
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 337
Common Window Types of Flink (2)
window 1 window 3
user1
user2
user3
window 2 window 4
time
window size window size
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 338
Common Window Types of Flink (3)
S ession windows, which are considered completed if there is no data within the preset time period.
user1
window 1 window 2 window 3 window 4
user2
window 1 window 2 window 3
user3
session gap
time
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 339
Fault Tolerance of Flink
The checkpoint mechanism is a key fault tolerance measure of Flink.
The checkpoint mechanism keeps creating status snapshots of stream applications. The status
snapshots of the stream applications are stored at a configurable place (for example, in the memory
of JobManager or on HDFS).
The core of the distributed snapshot mechanism of Flink is the barrier. Barriers are periodically inserted
into data streams and flow as part of the data streams.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 340
Checkpoint Mechanism (1)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 341
Checkpoint Mechanism (2)
Barrier
Source Intermediate Sink
operator operator operator
CheckpointCoordinator
Barrier
Source Intermediate Sink
operator operator operator
CheckpointCoordinator
Snapshot
Barrier
Source Intermediate Sink
operator operator operator
CheckpointCoordinator
Snapshot
CheckpointCoordinator
Snapshot
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 342
Checkpoint Mechanism (3)
Barrier of source A
A
C D
B Barrier of source B
Barrier of source A
A
C D
B Barrier of source B
Snapshot
A Merged barrier
C D
B
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 343
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture of FusionInsight HD
Flink
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 344
Location of Flink in FusionInsight Products
Application service layer
System
Hadoop API Plugin API management
Service
Hive MapReduce Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
• FusionInsight HD provides a Big Data processing environment and selects the best practice in the industry
based on scenarios and open source software enhancement.
• Flink is a unified computing framework that supports both batch processing and stream processing. Flink
provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 345
Flink Web UI
T he FusionInsight HD platform provides a visual management and monitoring UI for Flink. You
can use the YARN Web UI to query the running status of Flink tasks.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 346
Interaction of Flink with Other Components
HDFS
• (mandatory) Flink reads and writes data in HDFS.
YARN
• (mandatory) Flink relies on YARN to schedule and manage
resources for running tasks.
ZooKeeper
• (mandatory) Flink relies on ZooKeeper to implement the
checkpoint mechanism.
Kafka
• (optional) Flink can receive data streams sent from Kafka.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 347
Summary
• These slides describe the following information about Flink: basic concepts, application scenarios,
technical architecture, window types, and Flink on YARN.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 348
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 349
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 350
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 354
01 02
Introduction to Loader Loader Job Management
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 355
What Is Loader
• Loader is a loading tool for data and file exchange between FusionInsight HD and relational
databases and file systems. Loader provides a wizard-based job configuration management
Web UI and supports timed task scheduling and periodic Loader job implementation. On the
Web UI, users can specify multiple data sources, configure data cleaning and conversion
steps and the cluster storage system.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 356
Application Scenarios of Loader
RDB
Hadoop
SFTP Server
Loader • HDFS
FTP Server
• HBase
• Hive
Customized
Data Source
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 357
Position of Loader in FusionInsight
System
Hadoop API Plugin API management
Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
Loader is a loading tool for data and file exchange between FusionInsight HD and
relational databases and file systems.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 358
Features of Loader
Loader
GUI
• Provides a GUI that facilitates operations.
Security
• Kerberos authentication.
Highly Reliability
• Deploys Loader Servers in active / standby mode.
• Uses MapReduce to execute jobs and supports
retry after failure. High Performance
• Leaves no junk data after a job failure occurs. • Uses MapReduce for parallel data processing.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 359
Module Architecture of Loader
Transform Engine
Job
Execution Engine
Scheduler
Submission Engine Yarn Map Task
Job Manager
HBase
Metadata Repository
HDFS Reduce Task
HA Manager
Hive
Loader Server
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 360
Module Architecture of Loader - Module Description
Module Description
Loader Client Provides a web user interface (Web UI) and a command-line interface (CLI).
Processes operation requests sent from the client, manages connectors and
Loader Server
metadata, submits MapReduce jobs, and monitors MapReduce job status.
Provides a Representational State Transfer (REST ful) interface (HTTP + JSON) to
REST API
process the operation requests from the client.
Job Scheduler Periodically executes Loader jobs.
A data transformation engine that supports field combination, string cutting, and string
Transform Engine
reverse.
Execution Engine Executes Loader jobs in MapReduce manner.
Submission Engine Submits Loader jobs to MapReduce.
Manages Loader jobs, including creating, querying, updating, deleting, activating /
Job Manager
deactivating, starting and stopping jobs.
Metadata warehouse, which stores and manages connectors, conversion steps, and
Metadata Repository
Loader jobs.
Manages the standby and active status of Loader Servers. Two Loader Servers are
HA Manager
deployed in active / standby mode.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
361
01 02
Introduction to Loader Loader Job Management
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 362
Service Status Web UI of Loader
• Choose Services > Loader to go to the Loader Status page.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 12
363
Job Management Web UI of Loader
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 364
Job Management Web UI of Loader - Job
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 365
Job Management Web UI of Loader - Job
Conversion Rules
Loader Conversion Operators:
• Extracts Fields: separates an existing field by using specified delimiters to generate new fields.
• Modulo Integer: performs modulo operations on existing fields to generate new fields.
• String Cut: cuts existing string fields by the specified start position and end position to generate new fields.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 366
Creating a Loader Job - Basic Information
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 16
367
Creating a Loader Job - From
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
368
Creating a Loader Job - Transform
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 18
369
Creating a Loader Job - To
* Number 2
Copyright
Copyright ©2019
© Huawei
2019Technologies Co., Ltd. All rights reserved.
Huawei Technologies Co., Ltd. All rights reserved. Page
Page 37019
Monitoring Job Execution Status
• The page displays all current jobs and last execution status.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 371
Monitoring Job Execution Status - Job Execution
History
• The historical record page displays the start time, duration (s), status, failure cause,
number of read / written / skipped rows / files, dirty data link, and MapReduce log
link of each execution.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 372
Monitoring Job Execution Status - Dirty Data
Dirty data refers to those that does not meet Loader conversion rules, which can be
checked with the following steps.
• If the number of skipped job records is not 0 on the job history page, click the dirty data button to go to the
dirty data directory.
• Dirty data is stored in HDFS, and the dirty data generated by each Map job is stored in a separate file.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 22
373
Monitoring Job Execution Status - MapReduce Log
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 374
Monitoring Job Execution Status - Job Execution
Failure Alarm
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 375
Summary
This module describes the following information about Loader: main functions and features,
job management and monitoring.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 376
Quiz
• True or False:
A. FusionInsight Loader supports only data import and export between relational databases and
Hadoop HDFS and HBase.
B. Conversion steps must be configured for Loader jobs.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 377
Quiz
B. Dirty data refers to the data that does not comply with conversion rules.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 378
Quiz
B. If a Mapper execution fails after Loader submits a job to MapReduce, a second execution is automatically performed.
C. Residual data generated after a Loader job fails to be executed needs to be manually cleared.
D. After Loader submits a job to MapReduce for execution, it cannot submit other jobs before the execution is complete.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 379
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 380
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 383
What Flume is A
Functions of Flume B
Position of Flume
in FusionInsight C
Objectives System architecture
of Flume D
Upon completion of this course, you will be able
to know: Key characteristics
of Flume E
Application Examples
of Flume F
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 384
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 385
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 386
What is Flume
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 387
Functions of Flume
Flume can collect logs from a specified directory and save the logs in a
01 specified path (HDFS, HBase, and Kafka).
02 Flume can collect and save logs (taildir) to a specified path in real time.
Flume supports the cascading mode (multiple Flume nodes interwork with
03 each other) and data aggregation.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 388
Position of Flume in FusionInsight
Application service layer
System
Hadoop API Plugin API management
Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 389
Architecture of Flume (1)
• Basic Flume architecture: Flume can directly collect data on a single node. This architecture is mainly applicable to
data collection within a cluster.
• Multi-agent architecture of the Flume: Multiple Flume nodes can be connected. After collecting initial data from data
sources, Flume saves the data in the final storage system. This architecture is mainly applicable to the import of data
outside to the cluster.
Source Source
Sink Sink
Log HDFS
Channel Channel
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 390
Architecture of Flume (2)
Interceptor events
Channel
events
events events events
Channel Channel
Source Channel
Processor Selector
events
Sink Sink
Sink
Runner Processor
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 391
Basic Concept - Source (1)
• Event-driven source: The external source actively sends data to Flume to drive Flume to
accept the data.
• Event polling source: Flume periodically obtains data in an active manner.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 392
Basic Concept - Source (2)
thrift source The same as the avro source. The transmission protocol is Thrift.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 12
393
Basic Concept - Channel (1)
The channel is located between the source and the sink. The channel functions similar to the
queue. It temporarily saves events. When the sink successfully sends events to the next-hop
channel or the destination, the events are removed from the current channel.
Channels support transactions and provide weak sequence assurance. Channels can connect
any quantity of sources and sinks.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 394
Basic Concept - Channel (2)
Memory channel
Messages are saved in the memory. This channel supports high
throughput but no reliability. Data may be lost.
File channel
It supports data persistence. However, the configuration is
complex. Both the data directory and the checkpoint directory
need to be configured. Checkpoint directories need to be
configured for different file channels.
JDBC channel
It is the embedded Derby database. It supports event persistence
and high reliability. It can replace the file channel that also
supports persistence.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 395
Basic Concept - Sink (1)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 396
Basic Concept - Sink (2)
avro sink Transmits data to the next-hope Flume node using the Avro protocol.
thift sink The same as the avro sink. The transmission protocol is Thrift.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 16
397
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 398
Log Collection
• Flume can collect logs beyond a cluster and archive the logs in the HDFS, HBase,
and Kafka for data analysis and cleaning by upper-layer applications.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 399
Multi - level Cascading and Multi - channel Duplication
• Multiple Flume nodes can be cascaded. The cascaded nodes support internal
data duplication.
Source
Channel
Log
Sink
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 400
Message Compression and Encryption by
Cascaded Flume Nodes
• Data transmitted between cascaded Flume nodes can be compressed and encrypted,
thereby improving the data transmission efficiency and security.
Flume
RPC
Compression and
Decompression HDFS / Hive /
encryption
and decryption HBase / Kafka
Flume API
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 401
Data Monitoring
FusionInsight
Flume monitoring information
Manager
Flume
Application
Received data size Transmitted data size
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 402
Transmission Reliability
• Flume adopts the transaction management mode for data transmission. This mode ensures the
data security and enhances the reliability during transmission. In addition, if the file channel is used
to transmit data buffered in the channel, the data is not lost when a process or node is restarted.
Start tx
Take events
Send events
Start tx
Put events
End tx End tx
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 403
Transmission Reliability (Failover)
• During data transmission, if the next-hop Flume node is faulty or receives data abnormally, the
data is automatically switched over to another path.
Source Sink
HDFS
Source
Channel
Sink
Log
Channel
Source Sink
Sink
HDFS
Channel
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 404
Data Filtering During Transmission
• During data transmission, Flume roughly filters and cleans the data. The unnecessary data is
filtered. In addition, you can develop filter plug-ins based on the data particularity if you need to
filter complex data. Flume supports the third-party filter plug-ins.
Interceptor
Channel
events events
Channel Channel
Source
Processor Selector events
Channel
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 405
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 406
Flume Example 1 (1)
Data
Description preparations
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 407
Flume Example 1 (2)
Download the
Flume Client
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 408
Flume Example 1 (3)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 409
Flume Example 1 (4)
• Configure flume source
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_test
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.deserializer = LINE
server.sources.a1.selector.type = replicating
server.sources.a1.fileHeaderKey = file
server.sources.a1.fileHeader = false
server.sources.a1.channels = ch1
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 410
Flume Example 1 (5)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 411
Flume Example 1 (6)
• Configure flume sink
server.sinks.s1.type = hdfs
server.sinks.s1.hdfs.path = /tmp/flume_avro
server.sinks.s1.hdfs.filePrefix = over_%{basename}
server.sinks.s1.hdfs.inUseSuffix = .tmp
server.sinks.s1.hdfs.rollInterval = 30
server.sinks.s1.hdfs.rollSize = 1024
server.sinks.s1.hdfs.rollCount = 10
server.sinks.s1.hdfs.batchSize = 1000
server.sinks.s1.hdfs.fileType = DataStream
server.sinks.s1.hdfs.maxOpenFiles = 5000
server.sinks.s1.hdfs.writeFormat = Writable
server.sinks.s1.hdfs.callTimeout = 10000
server.sinks.s1.hdfs.threadsPoolSize = 10
server.sinks.s1.hdfs.failcount = 10
server.sinks.s1.hdfs.fileCloseByEndEvent = true
server.sinks.s1.channel = ch1
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 412
Flume Example 1 (7)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 413
Flume Example 1 (8)
mv /var/log/log.11 /tmp/log_test
03 log. 11 is already renamed log. 11. COMPLETED, which means success of data collection.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 414
Flume Example 2 (1)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 415
Flume Example 2 (2)
• Configure flume source:
server.sources = a1
server.channels = ch1
server.sinks = s1
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 416
Flume Example 2 (3)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 417
Flume Example 2 (4)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 418
Flume Example 2 (5)
• Upload the configuration file to flume.
• Use kafka demands to view data collected kafka topic_1028.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 419
Summary
This course describes Flume functions and application scenarios, including the basic concepts, functions,
reliability, and configuration items. Upon completion of this course, you can understand Flume functions,
application scenarios, and configuration methods.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 420
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 421
Quiz
True or False
• Flume supports cascading. That is, multiple Flume nodes can be cascaded for
data transmission.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 422
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 423
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 427
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 428
Kafka Overview
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 429
Kafka Overview
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 430
Kafka Overview
Application scenarios
• Compared with other components, Kafka features message persistence, high throughput, distributed processing and real-
time processing. It applies to online and offline message consumption and massive data collection scenarios, such as
website active tracking, operation data monitoring of the aggregation statistics system, and log collection, etc.
Frontend Backend
Producer Producer
Flume Storm
Kafka
Hadoop Spark
Farmer
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 431
Position of Kafka in FusionInsight
System
Hadoop API Plugin API management
Service
M/R Hive Kafka Spark Streaming Solr
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
Kafka is a distributed messaging system that supports online and offline message
processing and provides Java APIs for other components.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 432
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
Kafka Write Process
Kafka Read Process
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 433
Kafka Topology
ZooKeeper
Zoo Keeper
(Kafka) Broker Broker Broker Zoo Keeper
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 434
Kafka Topics
Consumer group 1
Consumer group 2 A consumer uses offsets to record and
read location information.
Kafka cleans up old messages
based on the time and size.
Kafka topic
… new
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 435
Kafka Partition
• Each topic contains one or more partitions. Each partition is an ordered and immutable
sequence of messages. Partitions ensure high throughput capabilities of Kafka.
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 436
Kafka Partition
• Consumer group A has two consumers to read data from four partitions
• Consumer group B has four consumers to read data from four partitions.
Kafka Cluster
Server 1 Server 2
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 437
Kafka Partition Offset
• The location of a message in a log file is called offset, which is a long integer that uniquely
identifies a message. Consumers use offsets, partitions, and topics to track records.
Consumer
group C1
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 438
Kafka Partition Replica (1)
Kafka Cluster
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 439
Kafka Partition Replica (2)
Follower->Leader
Pulls data
ReplicaFetcherThread
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7
writes
old new old new
Producer
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 440
Kafka Partition Replica (3)
ReplicaFetcherThread
… … …
ReplicaFetcherThread ReplicaFetcherThread-1
… … Follower
Partition-1
ReplicaFetcherThread-2
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 441
Kafka Logs (1)
• A large file in a partition is split into multiple small segments. These segments facilitate
periodical clearing or deletion of consumed files to reduce disk usage.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 442
Kafka Logs (2)
segment file 1
msg-00000000000
in-memory index
msg-00000000215
delete msg-00000000000
……
msg-00014517018
msg-00030706778 msg-00014516809
reads
……
append msg-02050706778
segment file N
msg-02050706778
msg-02050706945
……
msg-02614516809
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 443
Kafka Logs (3)
00000000000000368769.log
Message368770 0
00000000000000368769.index Message368771 139
1,0 Message368772 497
3,497 Message368773 830
6,1407 Message368774 1262
8,1686 Message368775 1407
… Message368776 1508
Message368777 1686
N,position
…
Message368769+N position
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 444
Kafka Log Cleanup (1)
• Log cleanup modes: delete and compact.
• Threshold for deleting logs: retention time limit and size of all logs in a partition.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 20
445
Kafka Log Cleanup (2)
Offset 0 1 2 3 4 5 6 7 8 9 10
Log
Key K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 Before
Compaction
Value V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
Compaction
3 4 6 8 9 10
Keys K1 K3 K4 K5 K2 K6 Log
After
Values V4 V5 V7 V9 V10 V11 Compaction
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 446
Kafka Data Reliability
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 447
Message Delivery Semantics
At Most Once
• Messages may be lost.
• Messages are never redelivered or reprocessed.
At Least Once
• Messages are never lost.
• Messages may be redelivered and reprocessed.
Exactly Once
• Messages are never lost.
• Messages are processed only once.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 448
Kafka Message Delivery
Asynchronous
Messages may be Messages may be Messages may be
replication At most once At most once
(leader) lost or repeated lost or repeated lost or repeated
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 24
449
Kafka Cluster Mirroring
ZooKeeper
ZooKeeper
Target Cluster
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 450
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 451
Write Data by Producer
Data Create
Data
Message
Publish
Message
Producer
Message
Kafka Cluster
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 452
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 453
Read Data by Consumer
Message
Kafka Cluster
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 454
Summary
This module describes the following information about Kafka: basic concepts and application
scenarios, system architecture and key processes.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 455
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 456
Quiz
A. HDFS.
B. ZooKeeper.
C. HBase.
D. Spark.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 457
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 458
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 459
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 463
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 464
ZooKeeper Overview
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 465
ZooKeeper Overview
Hadoop Ecosystem
Transter
HiveQL
Language
Learning
Machine
Data
Scripting
Query
PIG
Coordination Service
Column Datastore
Sqoop
ZooKeeper Mahout Hive
(Unstuctured)
Streaming
Hadoop
MapReduce Core Hadoop
Data
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 466
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 467
Position of ZooKeeper in FusionInsight
Application service layer
Open API / SDK REST / SNMP / Syslog
System
Hadoop API Plugin API management
Service
Hive M/R Spark Streaming Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
Based on the open source Apache ZooKeeper, ZooKeeper provides services for
upper-layer components and is designed to resolve data management issues in
distributed applications.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 468
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 469
ZooKeeper Service Architecture - Model
ZooKeeper Service
Leader
• A ZooKeeper cluster is a group of servers. In this group, one server functions as the leader and the other servers are followers.
• ZooKeeper selects a server as the leader upon startup.
• ZooKeeper uses a user-defined protocol named ZooKeeper Atomic Broadcast (ZAB), which ensures the data consistency among nodes in the system.
• After receiving a data change request, the leader first writes the change to local disks ,then to memory for restoration.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 470
ZooKeeper Service Architecture - Disaster
Recovery (DR)
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 471
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 472
Key Features of ZooKeeper
• Real-time capability: Clients can obtain server updates and failures within a
specified period of time.
• Reliability: A message will be received by all servers.
• Wait-free: Slow or faulty clients cannot intervene the requests of rapid clients so that
the requests of each client can be processed effectively.
• Sequence: The sequence of data status updates on clients is consistent with that of
request sending.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 473
Read Progress of ZooKeeper
• ZooKeeper consistency indicates that all servers connected to a client are displayed in the same view.
Therefore, read operations can be performed between the client and any server.
Client
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 474
Write Progress of ZooKeeper
3.2.Send Proposal
2.Write Request
4.1
ZK 1 (F) 3.1 ZK 2 (L) ZK 3 (F)
3.3.Send Proposal
4.2.ACK
4.3.ACK
5.1
5.3.Commit
Local Storage 5.2.Commit Local Storage Local Storage
1.Write Request
Client
6.Write Response
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 475
ACL (Access Control List)
The access control list (ACL) controls the access to ZooKeeper. It applies to specified znodes and
cannot be applied to all subnodes of the znodes. Run the setAcl / znode scheme:id:perm command to
set the ACL.
Scheme indicates the authentication mode. ZooKeeper provides four authentication modes:
• World: an ID. Any person can access ZooKeeper.
• auth: does not use any ID. Only authenticated users can access ZooKeeper.
• digest: uses the MD5 hash value generated by username : password as the authentication ID.
• IP: uses the client host IP address for authentication.
Id: checks whether authentication information is valid. The authentication methods varies with
different scheme.
Perm: indicates the permission that a user who passes ACL authentication can have for
ZooKeeper.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 476
Log Enhancement
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 477
Commands for ZooKeeper Clients
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 478
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 479
ZooKeeper and Streaming
ZooKeeper cluster
Streaming cluster
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 480
ZooKeeper and HDFS
ZooKeeper
Cluster
ZKFC
NameNode
NameNode Standby
Active
Monitoring
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 481
ZooKeeper and YARN
ZooKeeper
Cluster
Active ResourceManager Write the select message Standby ResourceManager Standby ResourceManager
Create Statestore directory to ZooKeeper firstly Monitor the select Get the message from
In ZooKeeper to be Active message of Active Statestore directory
Monitoring
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 482
ZooKeeper and HBase
ZooKeeper
Cluster
Monitoring
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 483
Summary
This module describes the following information about ZooKeeper:
• Functions and position in FusionInsight.
• Service architecture and data models.
• Read and write progresses as well as consistency.
• Creating and permission setting of ZooKeeper nodes.
• Relationship with other components.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 484
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 485
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 486
THANK YOU!
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 490
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 491
Apache Hadoop - Prosperous Open - Source Ecosystem
Hadoop Ecosystem Map
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 492
Big Data Is an Important Pillar for Huawei ICT Strategy
Huawei Big Data R & D
Huawei Strategy Map
Team Global Distribution
Content
and App
Third Third ISVs
Partners
Professional Service
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 493
FusionInsight HD: From Open - Source to Enterprise Versions
Version
Security Configuration
mapping
Easy
-to-Use
Performance Baseline
Patch selection
optimization selection
Security
Hadoop HBase Log
Enterprise
Reliability version
Prosperous
community
Initial
open-source
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 494
FusionInsight Platform Architecture
Safe city Power industry Financial industry Telecom Big data cloud services
FusionInsight Porter FusionInsight Miner data insight FusionInsight Farmer data intelligence FusionInsight Manager
Data integration Management platform
Weaver graphics analysis engine RTD real-time decision engine
Sqoop Miner Studio mining platform Farmer Base reasoning framework Security
Batch collection management
Performance
FusionInsight HD data processing
Flume Real-time management
collection Spark Storm / Flink
FusionInsight Elk
Collaboration service
One-stop analysis Stream processing Fault management
Standard SQL engine
framework framework FusionInsight
ZooKeeper
Kafka LibrA
Message queue Yarn resource management Parallel database Tenant management
CarbonData new file format HBase
Oozie NoSQL database Configuration
Job scheduling HDFS distributed file system management
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 495
Contribution to the Open - Source Community
Create top
Lead the community projects
community to and be recognized
complete future- by the ecosystem
oriented kernel-
level feature
Perform kernel- development
level development
to support key
service features
Be able to resolve
Be able to resolve kernel-level
kernel-level problems by teams
problems
(outstanding
individuals)
Locate peripheral
problems Large number of components and codes
• Outstanding product development and delivery capabilities and carrier-class operation support capabilities empowered by the
Hadoop kernel team.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 496
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 497
System and Data Reliability
System Reliability
• All components without SPOF.
• HA for all management nodes.
• Software and hardware health status monitoring.
• Network plane isolation.
Data Reliability
• Cross-data center DR.
• Third-party backup system integration.
• Key data power-off protection.
• Hot-swappable hard disks.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 498
System and Data Reliability
System Reliability
• All components without SPOF.
• HA for all management nodes.
• Software and hardware health status monitoring.
• Network plane isolation.
Data Reliability
• Cross-data center DR.
• Third-party backup system integration.
• Key data power-off protection.
• Hot-swappable hard disks.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 499
Security
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 500
Network Security and Reliability - Dual - Plane
Networking
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 501
Visualized Cluster Management, Simplifying O&M
Good
Bad
Unknown
Good
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 502
Graphical Health Check Tool (1)
28%
Check item failure rate
72%
Check item pass rate
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 503
Graphical Health Check Tool (2)
88% 60%
40%
20% 0%
0%
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 504
Easy Development
Native APIs of
Enhanced APIs
HBase
try { try {
table = new HTable(conf, TABLE); table = new ClusterTable(conf, CLUSTER_TABLE);
// 1. Generate RowKey. // 1. Create CTRow instance.
{......} CTRow row = new CTRow();
// 2. Create Put instance. // 2. Add columns.
Put put = new Put(rowKey); {........}
// 3. Convert columns into qualifiers(Need to consider merging } // 3. Put into HBase.
cold columns). table.put(TABLE, row);
// 3.1. Add hot columns. } catch (IOException e) {
{.......} // Does not care connection re-creation.
// 3.2. Merge cold columns.
{.......} Enhanced HBase SDK
put.add(COLUMN_FAMILY, Bytes.toBytes("QA"), hotCol); HBase
// 3.3. Add cold columns. Recoverable Schema table design
put.add(COLUMN_FAMILY, Bytes.toBytes("QB"), coldCols) Connection Manager Data tool
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 505
FusionInsight Spark SQL
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 506
Spark SQL Multi - Tenant
YarnQuery Tenant A
Spark JDBC
JDBC Proxy 1 Spark JDBCServer 1
Beeline
Spark JDBC Spark JDBCServer 2
Proxy 2
JDBC YarnQuery Tenant B
Beeline Spark JDBC
Proxy X
Spark JDBCServer 1
...
Spark JDBCServer 2
• The community's Spark JDBCServer supports only single tenants. A tenant is bound to a Yarn resource queue.
• FusionInsight Spark JDBCServer supports multiple tenants, and resources are isolated among
different tenants.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 507
Spark SQL Small File Optimization
1 MB 1 MB 1 MB … 1 MB 1 MB 1 MB … RDD1
1 MB 1 MB 1 MB … 1 MB 1 MB 1 MB … HDFS
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 508
Apache CarbonData - Converging Data Formats
of Data Warehouse (1)
OLAP
( multidimensional analysis ) CarbonData:
A single file format meets the
requirements of different access types.
to 688 times.
MapReduce Spark Flink
• OLAP / Interactive query: 20 to 33 times.
• Sequential access (large-scale scanning) :
Storage
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 509
Apache CarbonData - Converging Data Formats
of Data Warehouse (2)
• Apache Incubator Project since June 2016 CarbonData
• Apache releases
Compute
4 stable releases
Latest 1.0.0, Jan 28, 2017
Apache Spark Flink HIVE
• Contributors:
CarbonData supports IUD statements and provides data update and deletion capabilities in
big data scenarios. Pre-generated dictionaries and batch sort improve CarbonData import
efficiency while global sort improves query efficiency and concurrency.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 510
CarbonData Enhancement
Thrift Server
Spark SQL
Spark
Other CARBON
Data
Sources File Format
• Quick query response: CarbonData features high-performance query. The query speed of CarbonData is ten times that
of Spark SQL. The dedicated data format used by CarbonData is designed based on high-performance queries,
including multiple index technologies, global dictionary codes, and multiple push down optimizations, thereby quickly
responding to TB-level data queries.
• Efficient data compression: CarbonData compresses data by combining the lightweight and heavyweight compression
algorithms. This compression method saves 60% to 80% data storage spaces coupled with significant hardware storage
cost savings.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 511
Flink - Distributed Real - Time
Processing System
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 512
Visible HBase Modeling
Column
User list:
Each column Qualifier
indicates an HBase column
attribute of Mapping Each column indicates a KeyValue.
service data.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 513
HBase Cold Field Merging Transparent to
Applications
User Data
ID Name Phone ColA ColB ColC ColD ColE ColF ColG ColH
A B C D
HBase KeyValues
Problems
• High expansion rate and poor data query performance due to the HBase column increase.
• Increased development complexity and metadata maintenance due to the application layer
merging cold data columns.
Features
• Cold field merging transparent to applications.
• Real-time write and batch import interfaces.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 514
Hive / HBase Fine - Grained Encryption
Application scenarios
Hive / HBase
• Data saved in plaintext mode may cause security risks of
Sensitive Insensitive sensitive data leakage.
Sensitive
data write data read data
Solution
• Hive encryption of tables and columns.
• HBase encryption of tables, column families, and columns.
Encryption / Decryption • Encryption algorithms of AES and SM4, and user-defined
encryption algorithms.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 515
HBase Secondary Indexing
No index: “Scan+Filter”, scanning a large Secondary index: The target data can be located
amount of data. after twice I/Os.
• Index Region and Data Region as companions under a unified processing mechanism.
• Original HBase API interfaces, user-friendly.
• Coprocessor-based plug-ins, easy to upgrade.
• Write optimization, supporting real-time write.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 516
CTBase Simplifies HBase Multi - Table Service Development
Transaction CTBase
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 517
HFS Small File Storage and Retrieval Engine
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 518
Label - based Storage
The data of online applications is stored only on nodes
labeled with "Online Application" and is isolated from the
I/O conflicts affect online services. data of offline applications. This design prevents I/O
competition and improves the local hit ratio.
processing
processing
processing
application
application
application
Online
Online
Online
Batch
Batch
Batch
HDFS common storage HDFS label-based storage
• Solution description: Label cluster nodes based on applications or physical characteristics, for example, label a
node with “Online Application.” Then application data is stored only on nodes with specified labels.
• Application scenarios:
1. Online and offline applications share a cluster.
2. Specific services (such as online applications) run on specific nodes.
• Customer benefits:
1. I/Os of different applications are isolated to ensure the application SLA.
2. The system performance is improved by improving the hit ratio of application data.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 519
Label - based Scheduling
Large memory
Large memory
memory
Default
Default
Default
Large
Common scheduling Label-based scheduling
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 520
CPU Resource Configuration Period Adjustment
Batch processing application Real-time application Batch processing application Real-time application
CPU CPU
7:00 Time
20:00
• Solution description: Different services have different proportions of resources in different time segments. For
example, from 7:00 a.m. to 20: 00 p.m., real-time services can be allocated to 60% resources at peak hours. From
20:00 p.m. to 7: 00a.m., the 80% resource can be allocated to the batch processing applications when the real-time
services are at off-peak hours.
• Application scenario: The peak hours and off-peak hours of different services are different.
• Customer benefit: Services can obtain as many resources as possible at peak hours, boosting the average resource
utilization of the system.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 521
Resource Distribution Monitoring
Remaining HDFS Capacity ?
1705.00
1700.00
1695.00
1690.00
1685.00
1680.00
1675.00
04-27 21:05:00 04-27 21:15:00 04-27 21:30:00 04-27 21:45:00 04-27 22:00:00 04-27 22:05:00
Benefits
• Quick focusing on the most critical resource consumption.
• Quick locating of the node with the highest resource consumption to take appropriate measures.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 522
Dynamic Adjustment of the Log Level
• Application scenario: When a fault occurs in the Hadoop cluster, quickly locating the fault needs to change the log level. During
log level modification, the process cannot be restarted, resulting in service interruption.
How do I resolve this problem?
• Solution: Dynamically adjusting the log level on the Web UI.
• Benefits: When locating a fault, you can quickly change the log level of a specified service or node without restarting the service
or interrupting services.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 523
Wizard - based Cluster Data Backup
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Wizard - based Cluster Data Restoration
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Multi Tenant Management
Dept. A
Tenant A_1 Tenant A
Sub-department A_1
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 526
One Stop Tenant Management
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Visualized, Centralized User Rights Management
Visualized, centralized user rights management is easy to use, flexible, and refined:
• Easy to use: visualized multi-component unified user rights management.
• Flexible: role-based access control (RBAC) and predefined privilege sets (roles) which can be used repeatedly.
• Refined: multi-level (database / table / column-level) and fine-grained (Select / Delete / Update / Insert / Grant)
authorization.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 528
Automatic NTP Configuration
NTP Client
Management Management
Node (Active) Node (Standby)
NTP Server NTP Client
NTP Client NTP Client NTP Client NTP Client NTP Client
DataNode DataNode DataNode DataNode Control Node
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 529
Automatically Configuring Mapping of Hosts
Benefits
• Shorten environment preparation time to install
the Hadoop cluster.
• Reduce probability of user configuration errors.
• Reduce the risk of manually configuring mapping
for stable running nodes after capacity expansion
in a large-scale cluster.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 530
Rolling Restart / Upgrade / Patch
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 531
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 532
Huawei Smart Transportation Solution
Secure Organized
• Challenges to key vehicle identification: insufficient • Challenges to checkpoint and e-police capabilities:
capability of key vehicle automatic identification. rigid algorithm.
• Insufficient traffic accident detection capability: blind spot, • Challenges to violation review and handling
weak detection technology, and manual accident capabilities: heavy workload.
reporting and handling.
• Challenges to special attack data. analysis
• Low efficiency of special attacks: information capabilities: manual analysis and taking 7-30 days.
fragmentation and poor special attack platform.
Smooth Intelligent
• Challenges to traffic detection capability: faulty detection • Computing intelligence challenges: closed system and
devices, low detection efficiency, and low reliable technology and fragmented information.
detection results. • Perceptual intelligence challenges: weak awareness of
• Challenges to traffic analysis capabilities: not shared traffic, events, and peccancy.
traffic information among cities. • Cognitive intelligence challenges: lack of traffic
• Challenges to traffic signal optimization. awareness in regions and intersections.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 533
Traffic Awareness in the Whole City: Deep Learning and
Digital Transformation
• No camera is added. By deep learning and intelligent analysis, about 50 billion real-time pavement traffic parameters are
added every month, which lays a foundation for digital transformation of traffic.
Traffic accident
Traffic
perception and
signal
analysis
optimization
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 534
Traffic Big Data Analysis Platform
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 535
Limitations of Traditional Marketing Systems
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 536
Marketing System Architecture
ZooKeeper
ZooKeeper
Big data Flume Spark Loader Hive Farmer
MPPDB
platform Storm / Flink HBase MapReduce
Kafka Redis HDFS / Yarn RTD MQ Redis
Manager
Infrastructure /
Cloud platform x86 server ... x86 server Network device Security device
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 537
Big Data Analysis, Mining, and Machine Learning Make
Marketing More Accurate
Effect evaluation
Marketing
Data source and continuous
activity plan
optimization
Customer
group filtering Multiple
SMS
Marketing channels
Customer data activity App
Twitter
Correlation
analysis
Analysis report
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 538
Solution Benefits
Precise: precise customer
Easy to use: self-learning of rules
group mining
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 539
A Carrier: Big Data Convergence to Achieve Big Values
Manager
Yarn / ZooKeeper
Yarn / ZooKeeper
HBase
HDFS
ETL
Data source
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 540
Philippine PLDT: Converting and Archiving Massive
CDRs
Report / Interactive analysis / Forecast analysis / Text mining CSP
Data Federation
Periodically obtain the source file from the transit server, convert the files to the T0 / T1
format, and upload the converted files to the CSSD / DWH server.
Structured Data Unstructured Data
Mobile Social Voice
SUN NSN E / / / PLP ODS ... AURA Internet Media to Text
... ...
Hadoop stores original CDRs and structured and unstructured data, improving storage capacity
and processing performance, and reducing hardware costs.
A total of 1.1 billion records (664300 MB) are extracted, converted, and loaded at an overall
processing speed of 113 MB/s, much higher than the 11 MB/s expected by the customer.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 541
Summary
These slides describe the enterprise edition of Huawei FusionInsight HD, focus on FusionInsight HD features
and application scenarios, and describe Huawei FusionInsight HD success cases in the industry.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 542
Quiz
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 543
Quiz
True or False
• Hive supports encryption of tables and columns. HBase supports encryption of tables,
column families, and columns. (T or F).
• User rights management is role-based access control and provides visualized and unified
user rights management for multiple components. (T or F).
Multiple-Answer Question
• Which of the following indicate the high reliability of FusionInsight HD? ( )
A. All components are free of SPOFs. C. Health status monitoring for the software and hardware.
B. All management nodes support HA. D. Network plane isolation.
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 544
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 545
THANK YOU!