HCIA-Big Data V2.0 Training Material

HCIA-Big Data V2.
0
Training Material
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.

• Chapter 1 Big Data Industry and Technological Trends∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙4
• Chapter 2 HDFS - Hadoop Distributed File System∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙ 55
• Chapter 3 MapReduce - Distributed Off - line Batch Processing and Yarn - Resource Negotiator∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙88
• Chapter 4 Spark2x - In-memory Distributed Computing Engine∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙133
• Chapter 5 HBase - Distributed NoSQL Database∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙180
• Chapter 6 Hive - Distributed Data Warehouse∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙236
• Chapter 7 Streaming - Distributed Stream Computing Engine∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙276
• Chapter 8 Flink - Stream Processing and Batch Processing Platform∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙312
• Chapter 9 Loader - Data Transformation∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙352
• Chapter 10 Flume - Massive Logs Aggregation∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙382
• Chapter 11 Kafka - Distributed Message Subscription System∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙425
• Chapter 12 Zookeeper - Cluster Distributed Coordination Service∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙461
• Chapter 13 FusionInsight HD Solution Overview∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙488
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 1
Big Data Industry
and Technological
Trends

What big data is
A
Objectives
After completing this course, you will be able to
Big data technological trends and applications
B
understand:
Huawei big data solution
C
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era
CONTENTS
01 02 03 04
Era
Big Data As a Country Strategy for All Countries
• Group of Eight (G8) has released the G8 Open Data Charter and proposed to
accelerate the implementation of data openness and usability.
USA • The European Union (EU) promotes the Data Value Chain to transform traditional
governance model, reduce common department cost, and accelerate economic growth
and employment growth with big data.
G8 • The Abe Cabinet announced the Declaration to be the World's Most Advanced IT
Nation, which plans to develop Japan's national IT strategy with open public data and
big data as its core.
• The UK Government released the Capacity Development Strategy, which aims to use
UK data to generate business value and boost economic growth, as well as undertakes to
open the core databases in the transportation, weather, medical treatment fields.
Implementing the National Big Data Strategy
I mplementing the national big data strategy to accelerate the construction of a

“Digital China” involves five tasks, which are summarized as follows:
• Promote the innovation and development of big data technology.
• Build a digital economy with data as a key enabler.
• Improve the country's capability in governance with big data.
• Improve people's livelihood by applying big data.
• Protect national data security.
Big Data Era
D efinition from Wikipedia:

• Big data is data sets that are so voluminous and complex that traditional data-processing application
software is inadequate to deal with them.
01 Huge amount of data (Volume) 03 Data processing speed (Velocity)

4 V's
02 Various types of data (Variety) 04 Low data value density (Value)
Source of Big Data
• There are more than 200 million messages • There are 2.8 billion smartphone users
every day. worldwide.
• Hundreds of millions of devices that support

• There are more than 300 million active users
the Global Positioning System (GPS) are sold
every day.
each year.
• CERN: Experiments at CERN are generating

• Facebook: 50 TB log data is generated each an entire petabyte (PB) of data every second
day, with over 100 TB analysis data derived. as particles fired around the Large Hadron
Collider (LHC) .
Social data Machine data
Big Data Era
D efinition from Wikipedia:

• Big data is data sets that are so voluminous and complex that traditional data-processing application
software is inadequate to deal with them.
01 Huge amount of data (Volume) 03 Data processing speed (Velocity)

4 V's
02 Various types of data (Variety) 04 Low data value density (Value)
All Businesses Are Data Businesses
Data about your

Your business is a customers is as
data business now. valuable as your
customers.
Streaming data is Data as a Platform

Keep data moving.
business opportunity. (DaaP) .
Differences Between Data Processing in the Big Data Era and
the Traditional Data Processing
From databases (DBs) to big data (BD)

• “Pond fishing” vs. “Sea fishing”. “Fishes” represent the data to be processed.
Database Big Data

Data scale Small (in MB) Large (in GB, TB, or PB)
Data type Single (mostly structured) Diversified (structured, semi-structured, or unstructured)
Modes come ahead of data. Data comes ahead of modes. Modes evolve constantly
Relationship between modes and data
(Ponds come ahead of fishes.) as data increases.
(Fishes, determine whether other types of fishes exist by

Object to be processed Data (fishes in ponds)
certain fishes.)
Processing tool One size fits all. No size fits all.
Differences Between Data Processing in the Big Data Era and
the Traditional Data Processing
From databases (DBs) to big data (BD)

• “Pond fishing” vs. “Sea fishing”. “Fishes” represent the data to be processed.
Database Big Data

Data scale Small (in MB) Large (in GB, TB, or PB)
Data type Single (mostly structured) Diversified (structured, semi-structured, or unstructured)
Modes come ahead of data. Data comes ahead of modes. Modes evolve constantly
Relationship between modes and data
(Ponds come ahead of fishes.) as data increases.
(Fishes, determine whether other types of fishes exist by

Object to be processed Data (fishes in ponds)
certain fishes.)
Processing tool One size fits all. No size fits all.
Big Data Era
C hina's netizens rank the first in the world, and the data volume generated each day also surpasses
others in the world.
A camera working at a
Taobao website rate of 8 Mbit/s
• More than 50 thousand GB data • 3.6 GB data can be generated
generated per day. • per hour.
• 40 million GB data storage volume. • Tens of millions GB data can be
generated each month in one city.
Baidu Hospitals
• 1 billion GB data in total. • The amount of CT image data
• 1 trillion web pages stored. generated for a patient reaches
dozens of GB.
• About 6 billion search requests
to be processed each day. • Tens of billions GB data in a country
needs to be stored each year.
Big Data Era
• Decrease of hardware costs.
• Acceleration of network bandwidth.
• Emergence of cloud computing.
• Popularization of intelligent terminals.
• E-commerce and social networks.
• Comprehensive application of electronic maps.
• Internet of Things (IoT).
Relationship Between Big Data and People
If all you have is a hammer, everything

looks like a nail.
Today, it seems that big data is miraculous and omnipotent.

However, do not take big data as an all-round way to solve all
problems in the world. Human thought, personal cultural and
behavior modes, as well as the existence and development of
different nations and the society are complicated, intricate, and
unique. Computers cannot be used to let the numbers speak
for themselves. No matter when, it is people who are speaking
and thinking.
What Big Data Cannot Do?
Substitute managers' decision-making capabilities
01 • Big data is not only a technical issue, but also a decision-making issue.
• Data onboarding must be pushed and determined by top leaders.
Substitute effective business models

02 • Balance cannot always be obtained with big data.
• Business models are paramount. We must figure out how to obtain profits in advance.
Discover knowledge aimlessly

03 • Data mining must be worked out with restrictions and targets. Otherwise, it is futile.
Substitute the role of experts

04 • Experts contribute greatly to product modeling, such as IBM Deep Blue and AlphaGo.
• The role of experts may decrease as time goes on. However, experts play a major role in
the initial stage.
Build a model for permanent benefits

05 • Big data requires “live” data (with feedback).
• Models need to be updated through lifelong machine learning.
CONTENTS
01 02 03 04
Era
Big Data Era Leading the Future
Data has penetrated into every

industry and business domain.
• Discerning essences (services), forecasting trends, and guiding the

future are the core of the big data era.
• Guide the efforts we make now with a clear future target and make
due efforts now to secure future success.
Big Data Application Scope
Proportion of Top 100 industries using big data
17%
24%
14%
23% 8%
6%
4% 4%
Finance City Medical treatment Sports
Education Telecom others Retail
Big Data Application: Politics
Big data psychological analysis helped Trump win the

America's presidential election.
• Donald Trump employed Cambridge Analytica Ltd (CA) to make personality and requirement
analysis on American voters, which acquired personalities of 220 million Americans.
• CA uses the behavior of giving likes by voters on Facebook to analyze the personality traits and
political orientation of voters, classifies voters into three types (Republican supporters,
The cave (data analysis center) Democratic supporters, and swing voters), and focuses on attracting swing voters.
• Trump has never sent emails before. He bought his first smartphone after the presidential
election and was fascinated with Twitter. The messages sent by him on Twitter are data-riven
and vary for different voters.
• For African Americans, they can see the video in which the black is called predators by Hillary
Clinton, and thereby go away from Hillary's ballot box. These dark posts are visible for only
specified users.
Big Data Application: Finance
Obtain services at fixed Obtain services anytime and Importance of

times and places. anywhere. New
Passively receive data. Analyze and create data. customers data mining
Trust market information. Seek for meaningful experience.
Passively receive Review details.
information propagation. Involve in creating content,
products, and experience.
Operate customers
Omni-channel
Traditional
customers
Merchandise
Efficiency
New financial customers
Offer standard industrial services. institutions Scenario-focused
Focus on processes and procedures.
Serve
Passively receive information from a
single source. Contact customers by customers
Flexible personalized
services
customer managers. Interact with
each other in fixed channels and in
inflexible ways.
Traditional
finance
Big Data Application: Finance Case
Taobao website
Four-hour time difference
between the east and west
coasts of the USA.
Walmart
Walmart uses the sales
analysis result of the east
coast to guide the goods
arrangement of the west
coast on the same day.
Big Data Application: Education
N ow, big data analysis has been applied to American public education and become an important
force of education reform.
Average time for

answering each question
Sequence of
question-answering 12
in exams 11 1 Academic performance
Duration and frequency of

interaction with teachers 10 2 Enrollment rate
Big data
Duration and
correctness of 9 in school 3 Dropout rate
answering questions
education
Question- Rate of admission
answering times
8 4 into higher schools
Hand-raising
times in class
7 5 Literacy accuracy
6
Homework correctness
Big Data Application: Transportation
Most people may choose railway for a distance less than 500 km, but...
• Example mode of transportation for a distance less than 500 km: Beijing-
Taiyuan.
• Mode of transportation during the Price
Chinese Spring Festival in 2018. Time
Has the highest performance- More cost-effective because

price ratio but with difficulties in the performance-price ratio
Beijing scrambling for tickets of vehicle rental is inferior to
entraining
500 km
Shanghai
Chengdu
500 km
500 km
Guangzhou
500 km Plane Train Vehicle rental Long-distance
coach
• For a 500 km or 6-hours driving distance, railway has the highest performance-price ratio, but the chance of
buying tickets depends upon luck. The performance-price ratio of vehicle rental is inferior to entraining.
According to a survey, in the event of failing in scrambling for train tickets, more than 70% of people will rent a
vehicle to go home.
Big Data Application: Tourism
Island Travel Preference During China's National Day Holiday
3% 3%
5%
5%
29%
5%
9%
11%
18%
12%
Phuket Koh Samui Bali Kuala Lumpur

Okinawa Manila Colombo Jakarta
Jeju Honolulu
Big Data Application: Tourism
Honolulu
Colombo
Bali
Okinawa
Jeju
Phuket
Jakarta
Manila
Koh Samui
Kuala Lumpur
0 1000 2000 3000 4000 5000 6000

Air ticket price forecast
Big Data Application: Government and Public
Security
Public security scenario: automatic warning and response.
Area-based people flow threshold > 10,000 people Supervision department

performs real-time
locating of issues at the
initial stage.
Automatic warning system:
The number of people in right The crowd
side of Beijing Olympic Forest gathers to watch
Park exceeds the threshold. an affray.
Delivers the issue
for confirmation.
Reports to upper-
Area-based people Transaction
City or community level departments
flow threshold > processing
monitoring system
2000 people departments
confirmation
Warning for abnormal increase of people flow
Big data analysis can monitor and analyze population

flow into cities.
Big Data Application: Traffic Planning
Traffic planning scenarios: multi-dimensional analysis of the traffic crowd.
 North gate of Beijing Workers'
Gymnasium: The number of
people per hour exceeds 500.
Areas with people  Sanlitun: The number of people
flow exceeding the
per hour exceeds 800.
threshold once  Beijing Workers' Gymnasium:
The number of people per hour
exceeds 1500.
Analyze by people flow Analyze by travel mode

proportion
40%
35% 35%
15% 30%
20% 20%
10%
Younger Older
20-30 30-40 Bus Metro Auto Others
than 20 than 50
Traffic forecast based on the Road Network Planning Bus line planning
crowd
Big Data Applications: Sports
CONTENTS
01 02 03 04
Era
Challenges of Traditional Data Processing
Technologies APP APP
Scalability
There is a gap between
required for big
data scalability
data processing
requirements and
hardware performance.
Scale-up Scale-out
Portal Industry Progress
Special Self-service Data

• High cost for storing massive data.
application Report Select KPI OLAP analysis Data mining management
Appframe Spring
• Insufficient batch data processing
Application Middleware (Weblogic8.2 and Apache Tomcat 5.2)
performance.
• Lack of streaming data processing.
DB2 Oracle Sybase TD
• Limited scalability.
• Single data source.
Minicomputer Resource: P595, P570, and P690 Storage Network • External value-added data assets.
Traditional framework: midrange computer + disk array + commercial data warehouse
Application Scenarios of the Enterprise
Big Data Platform
Operation Management Supervision Profession
Telecom and finance Finance Government Government

Structured Structured + Structured + Non-structured
Semi-structured semi-structured
• Operation analysis. • Performance • Public security • Audio and video.
• Telecom signaling. management. network monitoring. • Seismic
• Financial • Report analysis. • Technical prospecting.
subledger. • History analysis. investigation for • Weather
• Financial bill. • Social security national security. nephogram.
• Electricity analysis. • Public opinion • Satellite remote.
distribution. • Tax analysis. monitoring. Sensing.
• Smart grid. • Decision-making • China Banking • Radar data.
support and Regulatory • IoT.
prediction. Commission (CBRC)
inspection.
• Food source tracing.
• Environment
monitoring.
• With strong appeals for data analysis in telecom carriers, financial institutions, and governments, new
technologies have been adopted on the Internet to process big data of low value density.
Challenges Faced by Enterprises (1)
Challenge 1: Business departments do not have clear

requirements on big data.
• Many enterprises' business departments are not familiar with big data as well as its application scenarios and benefits. Therefore, it
is difficult for them to provide accurate requirements on big data. The requirements of business departments are not clear. The big
data departments are non-profit departments. Therefore, enterprises' decision-makers worry about low input-output ratio and
hesitate to construct a big data department. Moreover, they even delete lots of valuable historical data because there is no
application scenario.
Challenge 2: Serious data silo problems within

enterprises.
• The most important challenge faced by enterprises in implementing big data is data fragments. In large-scale enterprises,
different types of data are often scattered in different departments, so that the same data within an enterprise cannot be
shared and the value of big data cannot be fully used.
Challenge 3: Low data availability and poor quality. The problem locating
time is decreased
by 50%.
• Many large and medium enterprises generate a large amount of data each day. However, some
Manual checks are
decreased due to
self-service on
enterprises pay no attention to big data preprocessing, resulting in nonstandard data processing.
problem detection.
The service revenue

During big data preprocessing, data needs to be extracted and converted into data that is easy to Availability is
improved by 10%.
is improved more
than 20%.
be processed, cleaned, and denoised to obtain valid data. According to data from Sybase, if high- Manual participation
is not required due to
quality data availability improves by 10%, enterprise revenue will improve more than 10%. proactive problem
detection.
The time spent in

identifying problems
is reduced by 90%.
Challenge 4: Data-related management technology

and architecture.
• Traditional databases cannot process hundreds of TB-scale data or above.

• Data diversities are not considered in traditional databases. In particular, the compatibility of structured data, semi-
structured data, and non-structured data is not considered.
• Traditional databases do not have high requirements on the data processing time. However, big data needs to be
processed in real time.
• O&M of massive data needs to ensure data stability, supports high concurrency, and reduces the server load.
Challenge 5: Data security.
• Network-based lives let criminals obtain personal information easily, and also lead to more crime methods
that are difficult to be tracked and prevented.
• How to ensure personal information security becomes an important subject in the big data era. In addition,
with the continuous increase of big data, requirements on the security of physical devices for storing data as
well as on the multi-copy and disaster recovery mechanism of data will become higher and higher.
Challenge 6: Insufficient big data talents.
• Each step of big data construction must be completed by professionals. Therefore, it is necessary to develop
and build a professional team that understands big data, knows much about administration, and has
experience in big data applications. Hundreds of thousands of big data-related jobs are increased globally
ever year. More than 1 million talent gaps will appear in the future. Therefore, universities and enterprises
make joint efforts to develop and mining talents.
Challenge 7: Tradeoff between data openness and

privacy.
• Today, with the increased importance of big data applications, opening and sharing of data resources has
become a key factor in maintaining advantages during data wars. However, the data opening will inevitably
infringe the privacy of some users. How to effectively protect the privacy of citizens and enterprises and
gradually strengthen the privacy legislation while promoting all-round data opening, applications, and sharing
will be a major challenge in the big data era.
From Batch Processing to Real - Time Analysis
• Hadoop is a basis for batch processing of big data, but Hadoop cannot provide real-time analysis.
Apache Hadoop Ecosystem

Ambari
Provisioning, Managing and Monitoring Hadoop Clusters
Machine Learning
R Connectors
Data Exchange
Columnar Store
SQL Query
Sqoop
Mahout
Workflow
Statistics
Scripting
HBase
Oozie
Hive
Pig
YARN MapReduce v2
ZooKeeper
Log Collector
Coordination
Distributed Processing Framework

Flume
HDFS
Hadoop Distributed File System
• Real-time intelligentization of highly integrated high-value information and knowledge is a main

commercial requirement.
Hadoop Reference Practice in the Industry
R
Intel Distribution for Apache Hadoop* software
Intel R Manager for Apache Hadoop software
Deployment, Configuration, Monitoring, Alerts, and Security
R
Data Exchange Oozie 3.3.0 Pig 0.9.2 Mahout 0.7 connectors
Sqoop1.4.1
Hive 0.9.0
Columnar Store
HBase 0.94.1
Workflow Scripting Machine Learning Statistics SQL Query
ZooKeeper 3.4.5
Coordination
YARN (MRv2)
Distributed Processing Framework
Log Collector
Flume 1.3.0
HDFS 2.0.3
Hadoop Distributed File System
Intel proprietary
Intel enhancements contributed back to open source All external names and brands are claimed as the property of others.
Open source components included without change
In-Memory Computing Reference Practice in the
Industry
Google PowerDrill
 Based on column-oriented storage,

PowerDrill uses the in-memory
client query execution tree
computing technology to deliver the
performance of querying trillions of
cell data per second, which is 10 to
root server
100 times of the traditional column-
oriented storage performance.
intermediate
 PowerDrill can quickly skip servers
unnecessary data blocks. Compared
with the full scanning performance,
the PowerDrill performance is leaf servers
improved by 100 times. (with local storage)
 Data memory usage space can be
optimized and reduced to 1/16 using
the compression and encoding
storage layer (e.g., GFS)
technologies.
Stream Computing Reference Practice in the
Industry
• IBM InfoSphere Streams is one of the core components of IBM's big data
strategy, supports high-speed processing of structured and unstructured data,
processing data in motion, throughput of millions of events per second, high
expansibility, and the streams processing language (SPL).
IBM
• HStreaming conducted a streaming reconstruction on the Hadoop MapReduce
framework. The reconstructed Hadoop MapReduce framework is compatible
with the existing mainstream Hadoop infrastructures. The Hadoop MapReduce
framework processes data in streaming MapReduce mode under the premise of
making no / tiny changes on the framework. Gartner rated HStreaming as the
coolest ESP vendor. Now, the reconstructed framework supports text and video
processing using the Apache Pig language (that is, Pig Latin) and provides the
high scalability of Hadoop, throughput of millions of events per second, and
millisecond-level delay.
Opportunities in the Big Data Era
O pportunity: The big data blue ocean strategy becomes a new

focus of enterprise competition.
• The huge commercial value brought by big

data will lead a great transformation that is
equal in force with the computer revolution in
the twentieth century. Big data is affecting
each field, including commercial, economic,
and other fields. Big data is promoting the
generation of a new blue ocean, bringing new
growth point of economy, and is becoming a
new focus of enterprise competition.
Talents Required During the Development of Big Data
• Big data system R&D engineers.
• Big data application development engineers.
• Big data analysts.
• Data visualization engineers.
• Data security R&D engineers.
• Data science research talents.
CONTENTS
01 02 03 04
Era
Huawei Big Data Platform Architecture
Application service layer
Open API / SDK REST / SNMP / Syslog
Data Information Knowledge Wisdom

DataFarm Porter Miner Farmer Manager
System
Hadoop API Plugin API management
Service
Hive MapReduce Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase
• The Hadoop layer provides real-time data processing environment, which is enhanced based on the community open
source software.
• The DataFarm layer provides end-to-end data insight and builds the data supply chain from data to information,
knowledge, and wisdom, including Porter for data integration services, Miner for data mining services, and Farmer for
data service frameworks.
• Manager is a distributed management architecture. The administrator can control distributed clusters from a single
access point, including system management, data security management, and data governance.
Core Capabilities of Huawei Big Data Team
Be able to establish
top-level projects that
Be able to take are adaptable to the
the lead in the eco-system in the
communities and communities
Be able to develop future-
independently oriented kernel
complete kernel- features
level development
for critical
service features
Be able to resolve
kernel-level
problems by team
Be able to resolve kernel-
level problems
(outstanding individuals)
Be able to locate
peripheral problems Large number of components and code
Apache open-source Frequent component updates

Be able to use Hadoop
community ecosystem
Efficient feature integration
• Outstanding product development and delivery capabilities and carrier-class operation support capabilities empowered by the
Hadoop kernel team.
49
Big Data Platform Partners from Finance and
Carrier Sectors
Industrial and
China Merchants Pacific Insurance
Commercial Bank China Mobile China Unicom
Bank (CMB) Co., Ltd. (CPIC)
of China (ICBC)
Top 3 50%
China Telco Top 10 Customers in China's
Financial Industry
Summary
These slides introduce:
• The big data era.
• Applications of big data in all walks of life.
• Opportunities and challenges brought by big data.
• Huawei big data solution.
Quiz
1. Where is big data from? What are the features of big data?
2. Which social fields can big data be applied to?

3. What is Huawei big data solution called?
More Information
• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796
• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797
• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798
• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
THANK YOU!

Technical Principles
of HDFS

HDFS application scenarios
A
Objectives
Upon completion of this course, you will be able
HDFS system architecture
B
to know:
Key HDFS features
C
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture
CONTENTS
01 02 03 04
Dictionary vs. File System
Dictionary File System

Character index File name Metadata
Dictionary body Data block
HDFS Overview
Hadoop distributed file system (HDFS) is developed based on Google file system
(GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems, HDFS also
provides the following features:
• High fault tolerance: resolves hardware unreliability problems.
• High throughput: supports applications involved with a large amount of data.
• Large file storage: supports TB and PB level data storage.
HDFS is applicable to: HDFS is inapplicable to:

• Store large files. • Store massive small files.
• Streaming data access. • Random write.
• Low-delay read.
HDFS Overview
Hadoop distributed file system (HDFS) is developed based on Google file system
(GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems, HDFS also
provides the following features:
• High fault tolerance: resolves hardware unreliability problems.
• High throughput: supports applications involved with a large amount of data.
• Large file storage: supports TB and PB level data storage.
HDFS is applicable to: HDFS is inapplicable to:

• Store large files. • Store massive small files.
• Streaming data access. • Random write.
• Low-delay read.
HDFS Application Scenarios
H DFS is a distributed file system of the Hadoop technical framework and is

used to manage files on multiple independent physical servers.
It is applicable to the following scenarios:
• Website user behavior data storage.

• Ecosystem data storage.
• Meteorological data storage.
CONTENTS
01 02 03 04
Position of HDFS in FusionInsight

System
Service
Hive M/R Spark Storm Flink
governance
management
HDFS / HBase
As a Hadoop storage infrastructure, HDFS serves as a distributed, fault-

tolerant file system with linear scalability.
CONTENTS
01 02 03 04
Basic System Architecture
HDFS Architecture
Metadata (Name,replicas,…) :
NameNode /home/foo/data,3,…
Metadata ops
Block ops
Client
Read DataNode DataNodes
Replication
Blocks Blocks
Client
Rack 1 Rack 2
HDFS Data Write Process
2:create
1:create Distributed
HDFS NameNode
3:write FileSystem
Client 7:complete
6:close FSData NameNode

OutputStream
Client node
4:write packet 5:ack packet
4 4
DataNode DataNode DataNode
5 5
HDFS Data Read Process
2:get block location

1:open Distributed
HDFS NameNode
3:read FileSystem
Client
6:close FSData NameNode
InputStream
Client node
5:read
4:read
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in FusionInsight HDFS System Key Features
Application Scenarios HD Architecture
Key Design of HDFS Architecture
NameNode / DataNode
Federation storage
in master / slave mode
Unified file system

Data storage policy
Namespace
HA HDFS Data replication
Multiple access modes Metadata persistence
Space reclamation Robustness
HDFS High Availability (HA)
ZooKeeper ZooKeeper ZooKeeper
Heartbeat Heartbeat
EditLog
ZKFC JN JN JN ZKFC
Read log
Write log
FSlmage
Metadata NameNode synchronization NameNode
operation (Active) (Standby)
HDFS
Block operation
Client Data read write Heartbeat
Copy
DataNode DataNode DataNode DataNode
Metadata Persistence
Active NameNode Standby NameNode

2. Obtains Editlog and FSImage from the active node.
Download FSImage when NameNode is initialized
Editlog FSImage and the local FSImage file is used later.
1. Rolls back Editlog.
Editlog.new Editlog FSImage

3. Merges Editlog
And FSImage.
FSImage.ckpt
4. Uploads the new FSImage
to the active node.
FSImage.ckpt
5. Rolls back FSImage.
Editlog FSImage
HDFS Federation
APP Client-1 Client-k Client-n
HDFS Namespace-1 Namespace-k Namespace-n
NN1 NN-k NN-n

Namespace
NS1 … NS-k … NS-n
Pool 1 Pool Pool n

Block Pools
Block Storage
Common Storage
DataNode1 DataNode2 DataNodeN

… … …
Data Replication
Data Center Placement policy
Distance=4
Distance=4
Distance=0
Client B1 B2 Node1 B4 Node1
Distance=2 Node2 Node2 Node2
B3 Node3 Node3 Node3
Node4 Node4 Node4
Node5 Node5 Node5
RACK1 RACK2 RACK3
Configuring HDFS Data Storage
Policies
B y default, the HDFS NameNode automatically

selects DataNodes to store data replicas. There are the
following scenarios in practice:
• Select a proper storage device for layered data storage from

multiple devices on a DataNode.
• Select a proper DataNode according to directory tags that
indicate data importance levels.
• Store key data in highly reliable node groups because the
DataNode cluster uses heterogeneous servers.
Configuring HDFS Data Storage Policies - Layered
Storage
Configuring DataNode with layered storage :
• The HDFS layered storage architecture provides four types of storage devices: RAM_DISK (memory
virtualization hard disk), DISK (mechanical hard disk), ARCHIVE (high-density and low-cost storage
media), and SSD (solid state disk).
• Storage policies for different scenarios are formulated by combining the four types of storage devices.
Block Location Alternative Alternative Replica

Policy ID Name
(Number of Replicas) Storage Policy Storage Policy
15 LAZY_PERSIST RAM_DISK: 1, DISK: n-1 DISK DISK
12 All_SSD SSD: n DISK DISK
10 ONE_SSD SSD: 1, DISK: n-1 SSD, DISK SSD, DISK
7 HOT (default) DISK: n <none> ARCHIVE
5 WARM DISK: 1, ARCHIVE: n-1 ARCHIVE, DISK ARCHIVE, DISK
2 COLD ARCHIVE: n <none> <none>
Configuring HDFS Data Storage Policies - Tag Storage
NameNode
/HBase T1 /Hive T1 T3
/Spark T2 /Flume T3
DataNode A DataNode B DataNode C

T1 T3 T1 T2
DataNode D DataNode E DataNode F

T1 T2 T3 T2 T3
Configuring HDFS Data Storage Policies - Node Group
Storage
Rackgroup2
Rackgroup1 (mandatory) Rackgroup3 Rackgroup4
Node 1 Node 3 Node 5 Node 7
Node 2 Node 4 Node 6 Node 8
File 1 (Number of replicas=1) File 2 (Number of replicas=3) File 3 (Number of replicas=2)
Colocation
T he definition of Colocation: is to store associated data or data that is going to be associated on the
same storage node.
According to the picture below, assume that file A and file D are going to be associated with each other,
which involves massive data migration. Data transmission consumes much bandwidth, which greatly
affects the processing speed of massive data and system performance.
NN
F
Aile A
A
A A
A B A
B C A
B D A
C D A File
A B
C D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D
Colocation Benefits
T he HDFS colocation: is to store files that need to be associated with each other on the same data node so
that data does not have to be obtained from other nodes during associated computing. This greatly reduces
network bandwidth consumption.
When joining files A and D with colocation feature, resource consumption can be greatly reduced because the
blocks of multiple associated files are distributed on the same storage node.
NN
F
Aile A
A C A
A B A
B C A
B A
C A D File
A B
D D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D
HDFS Data Integrity Assurance
H DFS ensures the completeness of the stored data. It implements reliability

processing in case of failure of each component.
Reconstructs data replicas in invalid data disks.

• The DataNode periodically reports blocks’ messages to the NameNode, if one replica (block) is failed,
the NameNode will start the procedure to recover lost replicas.
Ensures data balance among DataNodes.

• The HDFS architecture is configured with the data balance mechanism, which ensures the even
distribution of data among all DataNodes.
Ensures metadata reliability.
• The log mechanism is used to operate metadata, which is stored on both active and standby NameNodes.
• The snapshot mechanism of the file system ensures that data can be recovered in a timely manner when a
misoperation occurs.
Provides the security mode.
• HDFS provides a unique security mode to prevent fault spreading when a DataNode or hard disk is faulty.
Other Key Design Points of the HDFS Architecture
Unified file system:

HDFS presents itself as one unified file system externally.
Space reclamation:
The recycle bin mechanism is provided and the number of replicas can be dynamically set.
Data organization:
Data is stored by block in the HDFS.
Access mode:
Data can be accessed through Java APIs, HTTP, or shell commands.
Common Shell Commands
Type Commands Description

-cat Show the file contents
-ls Show a directory listing
-rm Delete files
-put Upload directory / files to HDFS

dfs
-get Download directory / files from HDFS
-mkdir Create a directory

-chmod/-chown Change the group of files
… …
-safe mode Safety mode operation
dfsadmin
-report Report service status
83
Summary
This module describes the following information about HDFS: basic concepts,
application scenarios, technical architecture and its key features.
Quiz
• What is HDFS and what can it be used for?

• What are the design objectives of HDFS?
• Describe the HDFS read and write processes.
More Information
• Exam outline:
• Mock exam:
THANK YOU!

Technical Principles of
MapReduce and YARN

Concepts of MapReduce and YARN
A
Application scenarios and principles of
MapReduce B
Objectives Functions and architectures of MapReduce
Upon completion of this course, you will be able to
know:
and YARN C
New Features of YARN
D
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN
CONTENTS
01 02 03 04
YARN YARN
MapReduce Overview
M apReduce is developed based on the paper issued by Google about MapReduce and is used for parallel
computing of a massive data set (larger than 1 TB) . It delivers the following highlights:
Easy to program Outstanding scalability High fault tolerance

Programmers only need to Cluster capabilities can be Cluster availability and fault
describe what to do, and the improved by adding nodes. tolerance are improved by
execution framework will do policies such as computing or
the job accordingly. data migration.
YARN Overview
A pache Hadoop YARN (Yet Another Resource Negotiator) is a new Hadoop resource manager. It
provides unified resource management and scheduling for upper-layer applications, remarkably
improving cluster resource utilization, unified resource management, and data sharing.
Position of YARN in FusionInsight
OpenAPI / SDK REST / SNMP / Syslog

System
Service
Hive M/R Spark Streaming Flink governance
Hadoop YARN / ZooKeeper LibrA

Security
management
HDFS / HBase
YARN is the resource management system of Hadoop 2.0. It is a general

resource management module that manages and schedules resources for
applications.
CONTENTS
01 02 03 04
YARN YARN
Working Process of MapReduce (1)
Before starting MapReduce, make sure that the files to be
processed are stored in HDFS.
MapReduce submits requests to ResourceManager. Then
ResourceManager creates jobs. One application maps to one Commit
job (example job ID: job_201431281420_0001).
Job.jar
Job.split
Job.xml
Before jobs are submitted, the files to be processed are split. By
After the jobs are submitted to ResourceManager, default, the MapReduce framework regards a block as a split.
Split
ResourceManager selects an appropriate NodeManager in the Client applications can redefine the mapping relation between
cluster to schedule ApplicationMasters based on the workloads blocks and splits.
of NodeManagers. The ApplicationMaster initializes jobs and
applies for resources from ResourceManager.
ResourceManager selects an appropriate NodeManager to start
the container for task execution. Map
The outputs of Map are placed to the buffer in memory. When
Buffer in the buffer overflows, data in the buffer needs to be written to
memory
local disks. Before that, the following process must be
1. Partition-By default, the hash algorithm is used for partitioning.
completed:
The MapReduce framework determines the number of partitions
based on that of Reduce tasks. The records with the same key Partition
value are sent to the same Reduce tasks for processing.
2. Sort — The outputs of Map are sorted, for example,

Sort
('Hi','1'),('Hello','1') are reordered as ('Hello','1'),('Hi','1').
Before starting MapReduce, make sure that the files to be
processed are stored in HDFS.
MapReduce submits requests to ResourceManager. Then
ResourceManager creates jobs. One application maps to one Commit
job (example job ID: job_201431281420_0001).
Job.jar
Job.split
Job.xml
Before jobs are submitted, the files to be processed are split. By
After the jobs are submitted to ResourceManager, default, the MapReduce framework regards a block as a split.
Split
ResourceManager selects an appropriate NodeManager in the Client applications can redefine the mapping relation between
cluster to schedule ApplicationMasters based on the workloads blocks and splits.
of NodeManagers. The ApplicationMaster initializes jobs and
applies for resources from ResourceManager.
ResourceManager selects an appropriate NodeManager to start
the container for task execution. Map
The outputs of Map are placed to the buffer in memory. When
Buffer in the buffer overflows, data in the buffer needs to be written to
memory
local disks. Before that, the following process must be
1. Partition-By default, the hash algorithm is used for partitioning.
completed:
The MapReduce framework determines the number of partitions
based on that of Reduce tasks. The records with the same key Partition
value are sent to the same Reduce tasks for processing.
2. Sort — The outputs of Map are sorted, for example,

Sort
('Hi','1'),('Hello','1') are reordered as ('Hello','1'),('Hi','1').
3. Combine — By default, this operation is optional. For

example, ('Hi','1'), ('Hi','1'),('Hello','1'), ('Hello','1') are combined Combine
into ('Hi','2'),('Hello','2').
4. Spill — After a Map task is processed, many spill files are

generated. These spill files must be combined into spill file
When the MOF output progress of Map tasks reaches 3%, the Spill/Merge (MOF: MapOutFile) that is partitioned and sorted. To reduce the
Reduce tasks are started and obtains MOF files from each Map amount of data to be written to disks, MapReduce allows MOFs
task. The number of Reduce tasks is determined by clients, and to be written after being compressed.
the number of MOF partitions is determined by that of Reduce
tasks. For this reason, the MOF files outputted by Map tasks
map to Reduce tasks.
Copy
In memory MOF files need to be sorted. If the amount of data received by

or on disk Reduce tasks is small, the data is directly stored in the buffer.
As the number of files in the buffer increases, the MapReduce
Sort/Merge background thread merges the files into a large one. Many
intermediate files are generated during the merge operation.
The last merge result is directly outputted to the Reduce
function defined by the user.
Reduce
Shuffle Mechanism
Combine Spill / Merge Copy Sort / Merge Reduce
Shuffle
is the data transfer process between the Map phase and Reduce phase involves
obtaining MOF files from the Map tasks of Reduce tasks and sorting and merging
MOF files.
Example: Typical Program WordCount
WordCount
2
App
Resource Name
Manager Node
Slaver #1 Slaver #2 Slaver #3

#1 #3 3
Node Node #2 Node
Manager Manager Manager
Key Value
Container A.1 Container A.2 Container A.3 a 1
a 1
a 1
1
are 1
Key Value
are 1
a 3 hi 1
are 2 hi 1
hello 1
4 hi 2
hello 1
hello 3 hello 1
Functions of WordCount
Input Output
Number of times that each

File that contains words
word occurs
Bye 3
Hello World Bye World
Hadoop 4
Hello Hadoop Bye Hadoop MapReduce
Hello 3
Bye Hadoop Hello Hadoop
World 2
Map Process of WordCount
Input Map Output

<Hello,1>
<World,1>
01 “Hello World Bye World” Map <Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
02 “Hello Hadoop Bye Hadoop” Map <Bye,1>
<Hadoop,1>
<Bye,1>
03 “Bye Hadoop Hello Hadoop” <Hadoop,1>
Map
<Hello,1>
<Hadoop,1>
Reduce Process of WordCount
Map Output Map Output Reduce Input Reduce Output
<Hello,1> <Hello,1>
<World,1> <World,2>
<Hello,1 1 1> Reduce Bye 3
<Bye,1> <Bye,1>
<World,1>
<Bye,1 1 1> Reduce Hadoop 4
<Hello,1> <Hello,1>
<Hadoop,1> Combine Shuffle
<Hadoop,2>
<Bye,1> <Bye,1>
<Hadoop,1> <World,2> Reduce Hello 3
<Bye,1>
<Bye,1>
<Hadoop,1>
<Hadoop,2>
<Hello,1> <Hadoop,2 2> Reduce World 2
<Hello,1>
<Hadoop,1>
Architecture of YARN
Node
Manager
Container App Mstr
client Node
Manager
Resource
Manager
client App Mstr Container
Node
MapReduce Status Manager
Job Submission
Node Status
Container Container
Resource Request
Task Scheduling of MapReduce on YARN
ResourceManager
1
Client
Applications Resource
Manager Scheduler
2
3 8
4
Node Manager Node Manager

6 2 5
Container 6 6 Container
5
Map Task MR App Reduce Map Task

7 Mstr 7 Task
Container Container 7
YARN HA Solution
R esourceManager in YARN manages resources and schedules tasks in the cluster. The YARN HA solution
uses redundant ResourceManager nodes to solve single point of failure problem of ResourceManager.
2.Fail-over if the Active

Active RM fails (auto) Standby
ResourceManager ResourceManager
1.Active AM write its

states into ZooKeeper
ZooKeeper Cluster
ZooKeeper ZooKeeper ZooKeeper
YARN APPMaster Fault Tolerant Mechanism
Container
AM-1
Container Restart / Failure
AM-1
Container
CONTENTS
01 02 03 04
YARN YARN
Resource Management
• Yarn manages and allocates memory and CPU resources.

• Memory and CPU resources from each NodeManager can
be configured (on the Yarn service configuration page).
yarn.nodemanager.resource.memory-mb
yarn.nodemanager.vmem-pmem-ratio
yarn.nodemanager.resource.cpu-vcore
Resource Allocation Model
Root
1.Selects a queue
Scheduler Parent Parent
Leaf Leaf Leaf
2.Selects an application
from the queue App1 App 2 … App N
3.Matches requested
resources on the Server A
application
Server B
Rack A
Rack B
Any resources
Capacity Scheduler Overview
C apacity Scheduler enables Hadoop applications to run in a shared, multi-tenant cluster while maximizing the
throughput and utilization of the cluster.
C apacity Scheduler allocates resources by queue. Users can set upper and lower limits
for the resource usage of each queue. Administrators can restrict the resource used by a
queue, user, or job. Job priorities can be set but resource preemption is not supported.
Highlights of Capacity Scheduler
• Capacity assurance: Administrators can set upper and lower limits for the resource usage of each
queue. All applications submitted to the queue share the resources.
• Flexibility: The remaining resources of a queue can be used by other queues that require resources. If a new
application is submitted to the queue, other queues release and return the resources to the queue.
• Priority: Priority queuing is supported (FIFO by default).

• Multi-leasing: Multiple users can share a cluster, and multiple applications can run concurrently.
Administrators can add multiple restrictions to prevent cluster resources from being exclusively occupied by an
application, user, or queue.
• Dynamic update of configuration files: Administrators can dynamically modify configuration

parameters to manage clusters online.
Task Selection by Capacity Scheduler
During scheduling, select an appropriate queue first based on the following

policies:
• The queue with the lower resource usage is allocated first. For example, you have two queues, Q1 and Q2, and both have the same capacities-30. And
the used capacities of Q1 is 10 and Q2 is 12, resources are allocated to Q1 first.
• Resources are allocated to the queue with the minimum queue hierarchy first. For example, for QueueA and QueueB. childQueueB, resources are
allocated to QueueA first.
• Resources are allocated to the resource reclamation request queue first.
A task is then selected from the queue based on the following policy:
• The task is selected based on the task priority and submission sequence as well as the limits of user resources and memory.
Queue Resource Limitation (1)
Q ueues are created on the Tenant page. After a tenant is created and associated with YARN, a
queue with the same name as the tenant is created. For example, if tenants QueueA and QueueB
are created, two YARN queues QueueA and QueueB are created.
Queue resource capacity (percentage) , there are three queues, default, QueueA, and QueueB, and each
has a [queue name]. capacity configuration:
The capacity of the default The capacity of the QueueA The capacity of the QueueB queue is
queue is 20% of the total cluster queue is 10% of the total cluster 10% of the total cluster resources. The
resources. resources. capacity of the root-default shadow
queue in the background is 60% of the
total cluster resources.
Due to resource sharing, the resources used by a queue may exceed its
01 capacity (for example, QueueA.capacity) . The maximum resource usage
can be limited by parameter.
Sharing Idle
Resources If only a few tasks are running in a queue, the remaining resource of the
queue can be shared with other queues. For example, if maximum-capacity
02
of QueueA is set to 100 and tasks are running in QueueA only, QueueA can
use all the cluster resources theoretically.
User and Task Limitation
L og into FusionInsight Manager and choose Tenant > Dynamic Resource Plan > Queue
Config to configure user and task limitation parameters.
User Limitation (1)
Minimum resource assurance (percentage) of a user:
• The resources for each user in a queue are limited at any time. If tasks of multiple users are running at the same time in a queue, the resource usage of each user fluctuates
between the minimum value and the maximum value. The maximum value is determined by the number of running tasks, while the minimum value is determined by minimum-
user-limit-percent.
For example, if yarn.scheduler.capacity.root.QueueA.minimum-user-limit-percent=25, the queue

resources are adjusted as follows when the number of users who submit tasks increases:
The first user submits tasks to

The user obtains 100% of QueueA resources.
QueueA
The second user submits tasks to
Each user obtains 50% of QueueA resources at most.
QueueA
The third user submits tasks to
Each user obtains 33.33% of QueueA resources at most.
QueueA
The fourth user submits tasks to
Each user obtains 25% of QueueA resources at most.
QueueA
To ensure that each user can obtain 25% resources at least, the fifth user
The fifth user submits tasks to
cannot obtain any resources and must wait for them to be released by other
QueueA users.
118
User Limitation (2)
Maximum resource usage of a user

(multiples of queue capacity) :
Indicates the multiples of queue capacity. This parameter is used to set the
resources that can be obtained by a user, with a default of 1:
yarn.scheduler.capacity.root.QueueD.user-limit- factor=1, indicating that
the resource capacity obtained by a user cannot exceed the queue
capacity. No matter how many free resources a cluster has, the resource
capacity that can be obtained by a user cannot exceed maximum-
capacity.
Task Limitation
01 02 03
Maximum number of Maximum number Maximum number of

active tasks of tasks in a queue tasks submitted by a
user
Indicates the maximum number of Indicates the maximum number of Depends on the maximum number of tasks in
active tasks in a cluster, including the tasks submitted to a queue. If the a queue. If QueueA allows a maximum of 1000
running or suspended tasks. When parameter value is set to 1000 for tasks, the maximum number of tasks that each
the number of submitted task QueueA, QueueA allows a user can submit is as follows:
requests reaches the limit, new tasks maximum of 1000 active tasks. 1000*yarn.scheduler.capacity.root.QueueA.min
will be rejected. The default value is imum-user-limit-percent(assume 25%)*
10000. yarn.scheduler.capacity.root.QueueA.user-
limit-factor (assume 1).
Queue Information
Choose Services > YARN > ResouceManager (active) > Scheduler to view queue information.
CONTENTS
01 02 03 04
YARN YARN
Enhanced Features - YARN Dynamic Memory Management
Calculate the memory usage of each

Container
No Containers can run.
No
Does the total memory usage

Does the memory usage exceed the memory threshold
Yes set for NodeManager?
Yes
exceed the container
threshold?
Containers with
NM MEM Thrshold = excessive memory
yarn.nodemanager.resource.memory-mb*1024*1024* usage cannot run.
yarn.nodemanager.dynamic.memory.usage.threshold
Enhanced Features - YARN Label - based
Scheduling
Applications that have Applications that have
Applications that have
common resource demanding memory
demanding I / O requirements
requirements requirements
Servers with standard Servers with large

Servers with high I / Os
performance memory
NodeManager
NodeManager NodeManager
Queue
Tasks
Summary
This module describes the following information: application scenarios and architectures of MapReduce and YARN,
Resource management and task scheduling of YARN, and enhanced features of YARN in FusionInsight HD.
Quiz
• What is the working principle of MapReduce?

• What is the working principle of Yarn?
Quiz
• What are highlights of MapReduce?

A. Easy to program.
B. Outstanding scalability.
C. Real-time computing.
D. High fault tolerance.
Quiz
• What is the abstraction of Yarn resources?

A. Memory.
B. CPU.
C. Container.
D. Disk space.
Quiz
• What does MapReduce apply to?

A. Iterative computing.
B. Offline computing.
C. Real-time interactive computing.
D. Stream computing.
Quiz
• What are highlights of capacity scheduler?

A. Capacity assurance.
B. Flexibility.
C. Multi-leasing.
D. Dynamic update of configuration files.
More Information
• Exam outline:
• Mock exam:
THANK YOU!

Technical Principles of
FusionInsight Spark2x

Understand application scenarios and master highlights
of Spark A
Master the computing capability and technical
Objectives
framework of Spark B
to:
Master the integration of Spark components in
FusionInsight HD C
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming
CONTENTS
01 02 03
• Spark Core
• Spark Streaming
Spark Introduction
• Apache Spark was developed in the UC Berkeley AMP lab in 2009.
• Apache Spark is a fast, versatile, and scalable in-memory big data computing engine.
• Apache Spark is a one-stop solution that integrates batch processing, real-time stream processing,
interactive query, graph computing, and machine learning.
Application Scenarios
• Batch processing can be used for extracting,

transforming, and loading (ETL).
• Machine learning can be used to automatically

determine whether comments of Taobao
customers are positive or negative.
• Interactive analysis can be used to query the
Hive data warehouse.
• Stream processing can be used for real-time
businesses such as page-click stream analysis,
recommendation systems, and public opinion analysis.
Spark Highlights
Light Fast
• Spark core code has • Delay for small
30,000 lines. 01 02 datasets reaches the
sub-second level.
Spark
Flexible Smart
• Spark offers different 04 03 • Spark smartly uses
levels of flexibility. existing big data
components.
Spark Ecosystem
Spark
Applications Environments Data sources
Hadoo
Hive HBase
p
Mahout Docker Elastic
Flume Mesos Kafka
MySQL
Spark vs MapReduce (1)
HDFS HDFS HDFS HDFS

Read Write Read Write
Iter.2
Iter.1 Iter.2 … Iter.1 Iter.2 …
Input Input
One-time
processing
HDFS Query 1 Result 1 Query 1 Result 1
Read
Query 2 Result 2 Query 2 Result 2
Input Query 3 Result 3 Input Distributed Query 3 Result 3

memory
… …
Data sharing in MapReduce Data sharing in Spark
Spark vs MapReduce (2)
Hadoop Spark Spark

Data volume 102.5 TB 102 TB 1000 TB
Time required (min) 72 23 234
Number of nodes 2100 206 206
Number of cores 50,400 6592 6592
Rate 1.4 TB/min 4.27 TB/min 4.27 TB/min
Rate / node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Daytona Gray Sort Yes Yes Yes
142
CONTENTS
01 02 03
• Spark Core
• Spark Streaming
Spark System Architecture
Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming
Spark Core
Standalone YARN Mesos
Existing functions of Spark 1.0
New functions of Spark 2.0
Spark System Architecture
Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming
Spark Core
Standalone YARN Mesos
Existing functions of Spark 1.0
New functions of Spark 2.0
Core Concepts of Spark - RRD
• Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed datasets.
• RDDs are stored in memory by default and are written to disks when the memory is insufficient.
• RDD data is stored in the cluster as partitions.
• RDD has a lineage mechanism (Spark Lineage), which allows for rapid data recovery when data loss occurs.
HDFS Spark cluster External storage
RDD1 RDD2
Hello Spark
Hello
“Hello Spark” “Hello, Spark”
Hadoop
“Hello Hadoop” “Hello, Hadoop”
China Mobile
“China Mobile” “China, Mobile”
RDD Dependencies
Narrow Dependencies: Wide Dependencies:
groupByKey
map, filter
join with inputs

co-partitioned
union join with inputs not
co-partitioned
Stage Division of RDD
A: B:
G:
Stage1 groupby
C: D: F:
map join
E:
Stage2 union
Stage3
RDD Operators
Transformation
• Transformation operators are invoked to generate a new RDD from one or more existing RDDs.
Such an operator initiates a job only when an Action operator is invoked.
• Typical operators: map, flatMap, filter, and reduceByKey.
Action
• A job is immediately started when action operators are invoked.

• Typical operators: take, count, and saveAsTextFile.
Major Roles of Spark (1)
Driver
Responsible for the application business logic and operation planning (DAG).
ApplicationMaster
Manages application resources and applies for resources from ResourceManager as needed.
Client
Demand side that submits requirements (applications).
Major Roles of Spark (2)
ResourceManager
ResourceManagement department that centrally schedules and distributes resources in the

entire cluster.
NodeManager
ResourceManagement of the current node.
Executor
Actual executor of a task. An application is split for multiple executors to compute.
Spark on YARN - Client Operation Process
Driver
ResourceManager
1. Submit an application
Spark on YARN-Client
YARNClientScheduler ApplicationMaster
Backend
3. Apply for 2. Submit

5. Schedule tasks a container ApplicationMaster
NodeManager
NodeManager
Container
4. Start the container
Executor
ExecutorLauncher
Cache Task
Spark on YARN - Cluster Operation Process
NodeManager
Spark on YARN-Client Container
4. Driver assigns tasks
1. Submit Executor
an application
ResourceManager 5. Executor Cache Task Task

reports task
statuses
the application
resources for NodeManager
2. Allocate
Container
Container
Executor
ApplicationMaster
Cache
(including Driver)
DAGScheduler Task
3. Apply for
Executor from Task
ResourceManager YARNClusterScheduler
Differences Between YARN - Client and YARN -
Cluster
Differences
• Differences between YARN-Client and YARN-Cluster lie in

ApplicationMaster.
• YARN-Client is suitable for testing, whereas YARN-Cluster is
suitable for production.
• If the task submission node in YARN-Client mode is down and the
entire task fails. Such a situation in YARN-Cluster mode will not
affect the entire task.
Typical Case - WordCount
textFile flatMap map reduceByKey saveAsTextFile

HDFS RDD RDD RDD RDD HDFS
An apple An apple A (An, 1) (An, 1) (An, 1)

A pair of shoes pair of shoes (apple, 1) (A,1) (A,1)
An apple Orange apple Orange apple (A, 1) (apple, 2)
A pair HDFS
of shoes (apple,
HDFS2)
(pair, 1) (pair, 1) (pair, 1)
Orange apple (of, 1) (of, 1) (of, 1)
(shoes, 1) (shoes, 1) (shoes, 1)
(Orange, 1) (Orange, 1) (Orange, 1)
(apple, 1)
CONTENTS
01 02 03
• Spark Core
• Spark Streaming
Spark SQL Overview
• Spark SQL is the module used in Spark for structured data processing. In Spark
applications, you can seamlessly use SQL statements or DataFrame APIs to query
structured data.
SQL AST Logical Code

Analysis Optimization Generation
Cost Model
Unresolved Optimized Selected
DataFrame Logical Plan Physical Plans RDDs
Logical Plan Logical Plan Physical Plan
Catalog
Dataset
Introduction to Dataset
• A dataset is a strongly typed collection of objects in a

particular domain that can be converted in parallel by a
function or relationship operation.
• A dataset is represented by a Catalyst logical execution
plan, and the data is stored in encoded binary form, and
the sort, filter, and shuffle operations can be performed
without deserialization.
• A dataset is lazy and triggers computing only when an
action operation is performed. When an action
operation is performed, Spark uses the query optimizer
to optimize the logical plan and generate an efficient
parallel distributed physical plan.
Introduction to DataFrame
D ataFrame is a dataset with specified column names. DataFrame is a special case of Dataset [Row].
name age height
Person String Int Double


RDD [Person] DataFrame
RDD, DataFrame, and Datasets (1)
RDD: DataFrame:
• Advantages: safe type • Advantages: schema
and object oriented. information to reduce
• Disadvantages: high serialization and
performance overhead deserialization overhead.
for serialization and • Disadvantages: not object-
deserialization; high oriented; insecure during
GC overhead due to compiling.
frequent object creation
and deletion.
160
RDD, DataFrame, and Datasets (2)
Characteristics of Dataset:
• Fast: In most scenarios, performance is superior to RDD. Encoders
are better than Kryo or Java serialization, avoiding unnecessary
format conversion.
• Secure type: Similar to RDD, functions are as secure as possible
during compiling.
• Dataset, DataFrame, and RDD can be converted to each other.
Dataset has the advantages of RDD and DataFrame, and

avoids their disadvantages.
Spark SQL and Hive
• The execution engine of Spark SQL is Spark Core, and the default
execution engine of Hive is MapReduce.
Differences • The execution speed of Spark SQL is 10 to 100 times faster than Hive.
• Spark SQL does not support buckets, but Hive does.
• Spark SQL depends on the metadata of Hive.

• Spark SQL is compatible with most syntax and functions of Hive. Dependencies
• Spark SQL can use the custom functions of Hive.
CONTENTS
01 02 03
• Spark Core
• Spark Streaming
Structured Streaming Overview (1)
S tructured Streaming is a streaming data-processing engine built on the Spark SQL

engine. You can compile a streaming computing process like using static RDD data. When
streaming data is continuously generated, Spark SQL will process the data incrementally and
continuously, and update the results to the result set.
Structured Streaming Overview (2)
Data stream Unbounded Table
new data in the

data stream
=
new rows appended to
an unbounded table
Data stream as an unbounded table
Programming Model for Structured Streaming
Trigger. every 1 sec

1 2 3
Time
Input Data up
to t=1
Data up
to t=2
Data up
to t=3
Query
Result up Result up Result up

Result to t=1 to t=2 to t=3
Output
complete mode
Example Programming Model of Structured
Streaming
nc
Cat dog Dog
> _ Owl cat
Dog dog owl
1 2 3
Time
Enter data in the Cat dog Cat dog Cat dog

unbounded table Dog dog Dog dog Dog dog
Owl cat Owl cat
Dog
Owl
T=1 computing result T=2 computing result T=3 computing result
Cat 2 Cat 2
Computing results Cat 1
Dog 3 Dog 4
Dog 3
Owl 1 Owl 2
CONTENTS
01 02 03
• Spark Core
• Spark Streaming
Overview of Spark Streaming
S park Streaming is an extension of the Spark core API, which is a real-time computing framework featured with
scalability, high throughput, and fault tolerance.
HDFS
Kafka
Spark Kafka
HDFS Streaming
Database
Mini Batch Processing of Spark Streaming
S park Streaming programming is based on DStream, which decomposes streaming programming into a
series of short batch jobs.
batches of
input data stream Spark batches of input data Spark processed data
Streaming Engine
Fault Tolerance Mechanism of Spark Streaming
S park Streaming performs computing based RDDs. If some partitions of an RDD are lost, the lost partitions
can be recovered using the RDD lineage mechanism.
Interval
[0,1)
map reduce
Interval
[1,2)
…
…
CONTENTS
01 02 03
• Spark Core
• Spark Streaming
Spark WebUI
Spark and Other Components
Ifollowing
n the FusionInsight cluster, Spark interacts with the
components:
• HDFS: Spark reads or writes data in the HDFS (mandatory).

• YARN: YARN schedules and manages resources to support the running of
Spark tasks (mandatory).
• Hive: Spark SQL shares metadata and data files with Hive (mandatory).
• ZooKeeper: HA of JDBCServer depends on the coordination of
ZooKeeper (mandatory).
• Kafka: Spark can receive data streams sent by Kafka (optional).
• HBase: Spark can perform operations on HBase tables (optional).
Summary
• The background, application scenarios, and characteristics of Spark are briefly introduced.
• Basic concepts, technical architecture, task running processes, Spark on YARN, and application scheduling
of Spark are introduced.
• Integration of Spark in FusionInsight HD is introduced.
Quiz
• What are the characteristics of Spark?

• What are the advantages of Spark in comparison with MapReduce?
• What are the differences between wide dependencies and narrow dependencies of
Spark?
• What are the application scenarios of Spark?
Quiz
• RDD operators are classified into: _________ and _________.
• The ___________ module is the core module of Spark.
• RDD dependency types include ___________ and ___________.
More Information
• Download training materials:

– http://support.huawei.com/learning/trainFaceDetailAction?lang=en&pbiPath=term1000121094&courseId=Node1000011807
• eLearning course:
• Outline:
– http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en
THANK YOU!

of HBase

System architecture of HBase
A
Key features of HBase
B
Objectives
to know:
Basic functions of HBase
C
Huawei enhanced features of HBase
D
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase
CONTENTS
01 02 03 04
HBase Overview
HBase is a column-based distributed storage system that

features high reliability, performance, and scalability.
• HBase is suitable for storing big table data (which contains billions of rows and millions of columns) and allows real-time
data access.
• HBase uses HDFS as the file storage system to provide a distributed column-oriented database system that allows real-
time data reading and writing.
• HBase uses ZooKeeper as the collaboration service.
HBase vs. RDB
• Distributed storage and column-oriented.

• Dynamic extension of columns.
HBase • Supports common commercial hardware, lowering
the expansion cost.
• Fixed data structure.

• Pre-defined data structure. RDB
• I/O intensive and cost-consuming expansion.
Data Stored By Row
ID Name Phone Address
Data is stored by row in an underlying file system. Generally, a fixed amount of

space is allocated to each row.
• Advantages: Data can be added, modified, or read by row.
• Disadvantages: Some unnecessary data is obtained when data in a column is queried.
Data Stored by Column
Data is stored by column in an underlying file system.

• Advantages: Data can be read or calculated by column.
• Disadvantages: When a row is read, multiple I/O operations may be required.
HBase vs. RDB
• Distributed storage and column-oriented.

• Dynamic extension of columns.
HBase • Supports common commercial hardware, lowering the
expansion cost.
• Fixed data structure.

• Pre-defined data structure. RDB
• I/O intensive and cost-consuming expansion.
Application Scenarios of HBase
HBase applies to the following scenarios:
• Massive data (TB and PB).

• The Atomicity, Consistency, Isolation, Durability (ACID) feature
supported by traditional relational databases is not required.
• High throughput.
• Efficient random reading of massive data.
• High scalability.
• Simultaneous processing of structured and unstructured data.
Position of HBase in FusionInsight

System
Service
governance
management
HDFS / HBase
HBase is a column-based distributed storage system that features high reliability,

performance, and scalability. It stores massive data and is designed to eliminate
limitations of relational databases in the processing of mass data.
KeyValue Storage Model (1)
Key-01 Value-ID01 Key-01 Value-Name01
Key-01 Value-Phone01 Key-01 Value-Address01
• KeyValue has a specific structure. Key is used to quickly query a data record, and Value is used to store user data.
• As a basic user data storage unit, KeyValue must store some description of itself, such as timestamp and type information. This
requires some structured space.
• Data can be expanded dynamically, adaptive to changes of data types and structures. Data is read and written by block. Different
Columns are not associated, so are tables.
Partition mode of a KeyValue Database-based on continuous Key range.
Region_01 Region_02 Region_05 Region_06 Region_09 Region_10
Node1 Node2 Node3
• Data subregions are created based on the RowKey range (sorting based on a sorting algorithm such as the alphabetic order
based on RowKeys). Each subregion is a basic distributed storage unit.
• The underlying data of HBase exists in the form of KeyValue. KeyValue has a specific format.
• KeyValue contains key information such as timestamp and type, etc.
• The same key can be associated with multiple Values. Each KeyValue has a qualifier.
• There can be multiple KeyValues associated with the same Key and Qualifier. In this case, they are distinguished
using timestamps. This is why there are multiple versions of the same data record.
CONTENTS
01 02 03 04
HBase Architecture (1)
Client ZooKeeper HMaster
HRegionServer HRegionServer
HRegion HRegion
HBase
Store MemStore Store MemStore Store MemStore Store MemStore

HLog
HLog
StoreFile StoreFile … StoreFile … … … StoreFile StoreFile … StoreFile … … …
HFile HFile HFile HFile HFile HFile
…DFS Client …DFS Client

HDFS
DataNode DataNode DataNode DataNode DataNode
HBase Architecture (2)
• Store: A Region consists of one or multiple Stores. Each store corresponds to a Column
RegionServer
Family.
• MemStore: A Store contains one MemStore. Data inserted to a Region by client is cached HLog
to the MemStore.
Region
• StoreFile: The data flushed to the HDFS is stored as a StoreFile in the HDFS.
Store MemStore
• HFile: HFile defines the storage format of StoreFiles in a file system. HFile is underlying
implementation of StoreFile. StoreFile StoreFile
… … …
• HLog: HLogs prevent data loss when a RegionServer is faulty. Multiple Regions in a HFile HFile
RegionServer share the same HLog.
HMaster (1)
"Hey, Region A, please move to RegionServer 1!"
"RegionServer 2 was gone! Let others take over
it's Regions!"
RegionServer1 RegionServer1 RegionServer1
HMaster (2)
The HMaster process manages all the RegionServers.

• Handles RegionServer failovers.
The HMaster process performs cluster operations including
creating, modifying, and deleting tables.
The HMaster process migrates Regions.
• Allocates Regions when a new table is created.
• Ensures load balancing during operation.
• Takes over Regions after a RegionServer failover occurs.
RegionServer
Region
• RegionServer is the data service process of HBase and is responsible for
processing reading and writing requests of user data.
• RegionServer manages Regions. All reading and writing requests of user

Region
data are handled based on interaction among Regions on RegionServers.
• Regions can be migrated between RegionServers.
RegionServer
Region
Region (1)
• A data table is divided horizontally into subtables based on the

KeyValue range to implement distributed storage. A subtable is called
a Region in HBase.
• Each Region is associated with a KeyValue range, which is described
using a StartKey and an EndKey.
• The HMaster process migrates Regions.
Each Region only needs to record a StartKey, because its EndKey serves as the StartKey of the next Region.
• Region is the most basic distributed storage unit of HBase.
Region (2)
Row001 Row001
Row002 Region-1
Row002
……….. StartKey, EndKey
……….. Row010
Row010
Row011
Row011 Row012 Region-2
Row012 ……….. StartKey, EndKey
……….. Row020
Row020 Row021
Row021 Row022 Region-3
……….. StartKey, EndKey
Row022
Row030
………..
Row031
Row030
……….. Region-4
Row031 ……….. StartKey, EndKey
……….. ………..
Region (3)
META Region
Region Region Region Region Region
• Regions are categorized as Meta Region and User Region.

• Meta Region records routing information of User Regions.
• Perform the following steps to access data in a Region:
Search for the address of the Meta Region.
Search for the address of the User Regions in the Meta Region.
Column Family
Region Region Region Region

/HBase/table
/region-1/ColumnFamily-1
/HBase/table
/region-1
/region-2
/region-3
HDFS
• A ColumnFamily is a physical storage unit of a Region. Multiple column families of the same Region have different paths in HDFS.
• ColumnFamily information is table-level configuration information. That is, multiple Regions of the same table have the same column family information. (For
example, each Region has two column families and the configuration information of the same column family of different Regions is the same) .
ZooKeeper
ZooKeeper provides the following functions for HBase:

1.Distributed lock service
• Multiple HMaster processes will try registering a node in ZooKeeper and the node can be registered only by one HMaster process. The process
that successfully registers the node becomes the active HMaster process.
2.Event listening mechanism

• The active Hmaster's record is deleted after the active process fails and the standby processes will receive an update message which indicates
the Active HMaster is down.
3.Micro database roles

• ZooKeeper stores the addresses of RegionServers. In this case, ZooKeeper can be regarded as a micro database.
MetaData Table
User Table 1
HBase: meta
…
…
…
• The MetaData Table HBase: Meta stores the information about Regions to
locate the Specific Region for Client. …
…
• The MetaData Table is into multiple Regions,
…
• and MetaData information of Region is stored
…
User Table N
in ZooKeeper. …
…
…
Mapping relation
MetaData Table
…
User table
CONTENTS
01 02 03 04
Writing Process
RegionServer (on the first floor)
Region 1 General Biology
Region 2 Environmental Biology

Region 1 Region 2
Region 3 Palaeontology
Region 3 Region 4
Region 4 Physiochemistry
Region 5
Region 5 Biophysics
Book classification
storage
Physiochemistry Region 4 Palaeontology Region3
Rowtey 001 Rowtey 002 Rowtey 003
Rowtey 006 Rowtey 007 Rowtey 009
Client Initiating a Data Writing Request
Client
• The process of initiating a writing request by a client is like sending books to a library
by a book supplier. The book supplier must determine to which building and floor the
books should be sent.
Writing Process - Locating a Region
1 Hi, META. Please send the bookshelf number, book number scope
(Rowkeys included in each Region), and information about the
floors where the bookshelves are located (RegionServers to which
the Regions belong) to me.
Bookshelf number
Region Book number
Rowkey
scope
Rowkey 070 Rowkey 071 Rowkey 072
Palaeontology
Region 3
Floor information
Regionserver
HClient
META
Writing Process - Grouping Data (1)
I have classified the books by

book number and I am going to
send the books to RegionServers.
Writing Process - Grouping Data (2)
Data groups includes two

division steps:
• Find the information of region and regionserver of
tables based on the meta table.
• Transfer data to specific region according to rwokey.
Data on each RegionServer is

sent at the same time. In this
case, the data has been divided
by Region.
Writing Process - Sending a Request to a RegionServer
RS 1 / RS 2 / RS 5, I am • Data is sent using the encapsulated RPC framework of HBase.

sending the books to you.
• Operations of sending requests to multiple RegionServers are implemented concurrently.

RegionServer
• After sending a data writing request, a client waits for the request processing result.
RegionServer • If the client does not capture any exception, it deems that all data has been written successfully. If
writing the data fails completely or partially, the client can obtain a detailed KeyValue list relevant to
the failure.
RegionServer
Writing Process - Process of Writing Data to a Region
RS1 RS2
path
Q1~Q3 Q4~Q5 Q1~Q3 Q4~Q5
Q8~Q1 Q8~Q1
Q6~Q7 0 Q6~Q7 0
Q11~Q12 Q11~Q12
RS5
I have stored the books in
sequence according to the book
number information provided by
HClient .
Q1~Q3 Q4~Q5
Q8~Q1
Q6~Q7 0
Q11~Q12
Writing Process - Flush
MemStore1
ColumnFamily-1 HFile
Region
MemStore-2
ColumnFamily-2 HFile
In either of the following scenarios, a Flush operation of MemStore is

triggered:
• The total usage of MemStore of a Region reaches the predefined Flush Size threshold.
• The ratio of occupied memory to total memory of RegionServer reaches the threshold.
• The number of WALs reaches the threshold.
• MemStore is updated every 1 hour by default.
• Users can flush a table or Region separately by a shell command.
Impacts of Multiple HFiles
25
20
Read latency,ms.
15
10
0
0 3600 7200 10800 14400
Load test time,sec.
As time passes by, the number of HFiles increases and a query request will take much more
time.
Compaction (1)
Compaction aims to reduce the number of small files in a column family in a Region,
thereby increasing reading performance.
There are two kinds of compaction: major and minor.
• Minor: compaction covering a small range. Minimum and • Major: compaction covering the HFiles in a column family in a
maximum numbers of files are specified. Small files at a Region. During major compaction, deleted data is cleared.
consecutive time duration are combined.
Files are selected based on a certain algorithm during minor compaction.
Compaction (2)
Write
put MemStore
Flush
HFile HFile HFile HFile HFile HFile HFile
Minor Compaction
HFile HFile HFile
Major Compaction
HFile
Region Split
Parent Region
• A common Region splitting operation is performed to split a Region into
two subregions if the data size of the Region exceeds the predefined
threshold.
• During splitting, the split Region suspends the reading and writing
services. During splitting, data files of the parent Region are not split
and rewritten to the two subregions. Reference files are created in the
new Region to achieve quick splitting. Therefore, services of the Region
are suspended only for a short time.
• Routing information of the parent Region cached in clients must be

updated.
DaughterRegion-1 DaughterRegion-2
Reading Process
RegionServer (on the first floor)
Region 1 General Biology
Region 2 Environmental Biology

Region 1 Region 2
Region 3 Palaeontology
Region 3 Region 4
Region 4 Physiochemistry
Region 5
Region 5 Biophysics
Book classification
storage
Physiochemistry Region 4 Palaeontology Region3
Rowtey 001 Rowtey 002 Rowtey 003 Rowkey 001 Rowkey 002 Rowkey 003
Rowtey 006 Rowtey 007 Rowtey 009 Rowkey 006 Rowkey 007 Rowkey 009
Client Initiating a Data Reading Request
Get
 When a precise key is provided, the

Get operation is performed to read a
single row of user data.
Scan
 The Scan operation is to batch scan

Client
user data of a specified Key range.
Locating a Region
1
Hi, META, I want to look for books whose code ranges is from
xxx to xxx, please find the bookshelf number and the floor
information within the code range.
Bookshelf number
Region Book number
Rowkey
scope
Palaeontology
Region 3
Floor information
Regionserver
HClient
META
OpenScanner
ColumnFamily-1
MemStore
• HFile-11
• HFile-12
Region
ColumnFamily-2
MemStore
• HFile-21
• HFile-22
D uring the OpenScanner process, scanners corresponding to MemStore and each HFile are
created:
• The scanner corresponding to HFile is StoreFileScanner.
• The scanner corresponding to MemStore is MemStoreScanner.
Filter
F ilter allows users to set filtering criteria during

the Scan operation. Only user data that meets the
 Satisfied Row
criteria returns.  Satisfied Row
T here are some typical Filter types:

• RowFilter
 Satisfied Row
• SingleColumnValueFilter
• KeyOnlyFilter
• FilterList
BloomFilter
BloomFilter is used to optimize scenarios where data is randomly read, that is,
scenarios where the Get operation is performed. It can be used to quickly check
whether a piece of user data exists in a large dataset (most data in the dataset
cannot be loaded to the memory).
A certain error rate exists when BloomFilter checks whether a piece of

data exits. Nevertheless, the conclusion indicated by the message
“User data XXXX does not exist” is accurate.
The data relevant to BloomFilter of HBase is stored in HFiles.
CONTENTS
01 02 03 04
Supporting Secondary Index
• The secondary index enables HBase to query data based on specific column values.
Column Family A Column Family B

RowKey A: Name A: Addr A: Age B: Mobile B: Email
01 ZhangSan Beijing 23 6875349 ……
02 LiLei Hangzhou 43 6831475 ……
03 WangWu Shenzhen 35 6809568 ……
04 …… Wuhan 28 6812645 ……
05 …… Changsha 26 6889763 ……
06 …… Jinan 35 6854912 ……
• When the secondary index is not used, the mobile field needs to be matched in the entire table by row to search for
specified mobile numbers such as “68XXX” which results in long time delay.
• When the secondary index is used, the index table is searched first to identify the location of the mobile number,
which narrows down the search scope and reduces the time delay.
226
HFS
• HBase FileStream (HFS) is a separate module of HBase. As an encapsulation of HBase and HDFS interfaces, HFS
provides capabilities, such as storing, reading and deleting files for upper-level applications.
• HFS provides the ability of storing massive small files and large files in HDFS. That is, massive small files (less than
10MB) and some large files (larger than 10MB) can be stored in HBase.
HBase MOB (1)
MOB Data (100KB to 10MB) is directly stored in the file system (HDFS for
example) as HFile. And the information about address and size of file is stored in
HBase as a value. With tools managing these files, the frequency of compation
and split can be highly reduced, and performance can be improved.
HBase MOB (2)
Client ZooKeeper HMaster
HRegionServer HRegionServer
HRegion HRegion HRegion

HLog
HLog
… MOB … MOB …
Store Store Store … Store …
HDFS
Storefiles Storefiles Storefiles Storefiles MOB files

HFile HFile HFile HFile HFile
Summary
This module describes the following information about HBase: KeyValue Storage Model, technical architecture,
reading and writing process and enhanced features of FusionInsight HBase.
Quiz
• Can the services of the Region in HBase be provided when splitting?

• What are the advantages of the Region splitting of HBase?
Quiz
• What is Compaction used for? ( )

A. Reducing the number of files in a column family and Region.
B. Improving data reading performance.
C. Reducing the number of files in a column family.
D. Reducing the number of files in a Region.
Quiz
• What is the physical storage unit of HBase? ( )

A. Region.
B. Column Family.
C. Column.
D. Cell.
More Information
• Download training materials:

– http://support.huawei.com/learning/trainFaceDetailAction?lang=en&pbiPath=term1000121094&courseId=Node1000011807
• eLearning course:
• Outline:
– http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en
THANK YOU!

of Hive

Foreword
B ased on Hive provided by the Hive Open Source

community, Hive in FusionInsight HD has various
enterprise-level customization features, such as
Colocation table creation, column encryption, and
syntax enhancement. With these features,
FusionInsight HD outperforms the community
version in terms of reliability, tolerance, scalability,
and performance.
Hive application scenarios and basic
principles A
Objectives
Enhanced features of FusionInsight Hive
B
to know:
Common Hive SQL statements
C
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations
CONTENTS
01 02 03
Hive Overview
Hive is a data warehouse tool running on Hadoop and supports PB-

level distributed data query and management.
Hive provides the following functions:
• Flexible ETL (extract / transform / load).
• Supporting computing engines, such as MapReduce, Tez, and Spark.
• Direct access to HDFS files and HBase.
• Easy to use and program.
Hive Overview
Hive is a data warehouse tool running on Hadoop and supports PB-

level distributed data query and management.
Hive provides the following functions:
• Flexible ETL (extract / transform / load).
• Supporting computing engines, such as MapReduce, Tez, and Spark.
• Direct access to HDFS files and HBase.
• Easy to use and program.
Application Scenarios of Hive
Data mining
Data aggregation
• Interest analysis
• Daily / Weekly click count
• User behavior analysis
• Traffic statistics
• Partition demonstration
Non-real-time Data warehouse

analysis • Data extraction
• Log analysis • Data loading
• Text analysis • Data transformation
Position of Hive in FusionInsight

System
Service
governance
management
HDFS / HBase
Hive is a data warehouse tool, which employs HiveQL (SQL-like) to query

data. All Hive data is stored in HDFS.
Comparison Between Hive and Traditional Data
Warehouses (1)
Hive Traditional Warehouse
Cluster, which is of limited storage capacity. The cluster calculation speed

decreases dramatically when the storage capacity increases. It is applicable only to
Storage HDFS. Theoretically, it is infinitely scalable.
commercial applications that involve a small amount of data, and cannot handle an
extra-large amount of data.
Execution An algorithm with higher efficiency can be used to query data. More optimization
MapReduce /Tez / Spark
engine measures can be taken to improve the efficiency.
Usage HQL (similar to SQL) SQL
Metadata and data are stored separately for
Flexibility Low. Data is used for limited purposes.
decoupling.
The calculation speed depends on cluster size. Hive

It is fast when there is a small amount of data. Nevertheless, the speed decreases
Analysis speed is scalable. It is more efficient than traditional data
dramatically when the amount of data becomes large.
warehouses when there is a large amount of data.
Comparison Between Hive and Traditional Data
Warehouses (2)
Hive Traditional Data Warehouses
Low efficiency. It has not met expectations
Index currently. Efficient.
An application model must be developed. This It provides a set of well-developed report solutions to facilitate data
Usability
results in high flexibility but low usability. analysis.
Data is stored in HDFS, which features high It has relatively low reliability. When a query attempt fails, the query must
Reliability
reliability and high fault tolerance. be restarted. Data fault tolerance relies on hardware RAID.
Environment
It can be deployed using common computers. It requires high-performance commercial servers.
dependence
Price Open-source product. The data warehouses used for commercial purposes are expensive.
246
Advantages of Hive
Advantages of Hive
1 2 3 4
High reliability
and tolerance SQL-like query Scalability Multiple interfaces
• HiveServer in • SQL-like syntax • User defined storage • Beeline
cluster mode • Built-in functions in format • JDBC
• Dual-MetaStore large quantity • User defined • Thrift
• Query retry after functions (UDF / • Python
timeout UDAF / UDTF) • ODBC
Disadvantages of Hive
Disadvantages of Hive
1 2 3 4
Not support Inapplicable to Not support
High latency materialized OLTP storage process
• MapReduce • Does not support • Does not support • Does not support
execution engine materialized views. column-level data storage process,
by default. • Data updating, adding, updating, but supports logic
• High latency of insertion, and deletion and deletion. processing using
MapReduce. cannot be performed UDF.
on views.
CONTENTS
01 02 03
Hive Architecture
Hive
JDBC ODBC
Web
Command Line Interface Thrift Server
Interface
Driver
MetaStore
(Compiler, Optimizer, Executor)
Hive Architecture in FusionInsight HD
Hive contains HiveServer, MetaStore, and

WebHcat.
Hiveserver (s) WebHcat (s)
• HiveServer: receives requests from clients, parses and executes

HQL commands, and returns query results. Metastore (s)
• MetaStore: provides metadata services.
• WebHcat: provides HTTP services, such as metadata, Data
Defined Language (DDL) for external users. DBService / HDFS /
YARN
Architecture of WebHCat
WebHCat provides Rest interface for users to make the following operations through safe
HTTPS protocol：
• Hive DDL operations
• Running Hive HQL task
• Runing MapReduce task
WebHCat
Server (s) HCat
also known as
Templeton DDL
Server (s)
Queue
Pig. Hive
WebHCat HDFS
MapReduce
Job_ID
Data Storage Model of Hive
Database
Table Table
Partition
Bucket
Bucket
Skewed Normal
Partition
Bucket
Bucket
data data
Data Storage Model of Hive - Partition and
Bucket
Partition: A data table can be divided into partitions
by using a field value.
• Each partition is a directory.
• The number of partitions is configurable.
• A partition can be partitioned or bucketed.
Bucket: Data can be stored in different buckets.

• Each bucket is a file.
• The bucket quantity is set when a table is created and data can be sorted in the
bucket.
• Data is stored in a bucket by the hash value of a field.
Data Storage Model of Hive - Partition and
Bucket
Partition: A data table can be divided into partitions
by using a field value.
• Each partition is a directory.
• The number of partitions is configurable.
• A partition can be partitioned or bucketed.
Bucket: Data can be stored in different buckets.

• Each bucket is a file.
• The bucket quantity is set when a table is created and data can be sorted in the
bucket.
• Data is stored in a bucket by the hash value of a field.
Data Storage Model of Hive - Managed Table and
External Table
Hive can create managed table and external table:
• Managed tables are created by default and managed by Hive. In this case, Hive migrates data to data warehouse directories.
• When external tables are created, Hive access data from locations outside data warehouse directories.
• Use managed tables when Hive performs all operations.
• Use external tables when Hive and other tools share the same data set for different processing.
Managed Table External Table

CREATE / Data is migrated to data The location of external data is specified
LOAD warehouse directories. when a table is created.
DROP Metadata and data are deleted. Only metadata is deleted.
Data Storage Model of Hive - Managed Table and
External Table
Hive can create managed table and external table:
• Managed tables are created by default and managed by Hive. In this case, Hive migrates data to data warehouse directories.
• When external tables are created, Hive access data from locations outside data warehouse directories.
• Use managed tables when Hive performs all operations.
• Use external tables when Hive and other tools share the same data set for different processing.
Managed Table External Table

CREATE / Data is migrated to data The location of external data is specified
LOAD warehouse directories. when a table is created.
DROP Metadata and data are deleted. Only metadata is deleted.
Functions of Hive
Built-in functions in Hive:

• Mathematical Function, such as round( ), floor( ), abs( ), rand( ), etc.
• Date Function, such as to_date( ), month( ), day( ), etc.
• String Function, such as trim( ), length( ), substr( ), etc.
UDF (User-Defined Function)
Enhanced Features of Hive - Colocation
Overview
• Colocation: storing associated data or data on which associated operations are performed on
the same storage node.
• File-level Colocation allows quick file access. This avoids network consumption caused by
data migration.
NN #1
A C D A B D B C B C A D
DN #1 DN #2 DN #3 DN #4 DN #5 DN #6
Enhanced Features of Hive - Using Colocation
Step 1: Use a HDFS interface to create groupid and locatorid.
hdfs colocationadmin -createGroup -groupId groupid -locatorIds

locatorid1,locatorid2,locatorid3;
Step 2: Use the Hive Colocation function.
CREATE TABLE tbl_1 (id INT, name STRING) stored as RCFILE

TBLPROPERTIES("groupId"="group1","locatorId"="locator1");
CREATE TABLE tbl_2 (id INT, name STRING) row format delimited
fields terminated by '\t' stored as TEXTFILE
TBLPROPERTIES("groupId"="group1","locatorId"="locator1");
Enhanced Features of Hive - Encrypting
Columns
Step 1: When creating a table, specify the columns to be
encrypted and the encryption algorithm.
create table encode_test (id INT, name STRING, phone STRING,

address STRING) row format serde
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
WITH SERDEPROPERTIES(
"column.encode.columns"="phone,address","column.encode.
classname"="org.apache.hadoop.hive.serde2.AESRewriter"
);
Step 2: Use an insert syntax to import data to tables whose

columns are encrypted.
insert into table encode_test select id, name, phone, address

from test;
Enhanced Features of Hive - Deleting HBase
Records in Batches
Overview：
• In FusionInsight HD, Hive allows deletion of a single record from an HBase table. Hive can use specific syntax to delete one or more data
records that meet criteria from its HBase tables.
Usage：
• To delete some data from an HBase table, run the following HQL statement:
remove table HBase_table where expression;
here, expression indicates the criteria for selecting the data to be deleted.
Enhanced Features of Hive - Controlling Traffic
By using the traffic control feature, you can control:

• Total number of established connections
• Number of established connections of each use
• Number of connections established within a unit period
263
Enhanced Features of Hive -
Specifying Row Delimiters
Step 1: Set inputFormat and outputFormat when creating a table.
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]

[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
STORED AS
inputformat
"org.apache.hadoop.hive.contrib.fileformat.SpecifiedDelimiterInputFormat"
outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
Step 2: Specify the delimiter before a query.
set hive.textinput.record.delimiter=“!@!“;
CONTENTS
01 02 03
Hive SQL Overview
DDL-Data definition language

• Table creation, table modification and deletion, partitions, and data types.
010101010101
010101010101 DML-Data manipulation language
010101010101 • Data import, export.
DQL-Data query language

• General query.
• Complicated query, like Group by, Order by, Join, etc.
Hive Basic Operations (1)
Data format example: 1,huawei,1000.0
• Create managed table.
CREATE TABLE IF NOT EXISTS example.employee(

Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' STORED AS TEXTFILE;
• Create external table.

CREATE EXTERNAL TABLE IF NOT EXISTS
example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/localtest';
• Modify a column.
ALTER TABLE employee1 CHANGE money string COMMENT 'changed
by alter' AFTER dateincompany;
• Add a column.
ALTER TABLE employee1 ADD columns(column1 string);
• Modify the file format.

ALTER TABLE employee3 SET fileformat TEXTFILE;
• Delete table data.

DELETE column_1 from table_1 WHERE column_1=??;
DROP table_a;
• Describe table.
DESC table_a;
• Show the statements for creating a table.

SHOW CREATE table_a;
• Load data from the local.
LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE
example.employee;
• Load data from another table.

INSERT INTO TABLE company.person PARTITION(century=
'21',year='2010')
SELECT id, name, age, birthday FROM company.person_tmp WHERE
century= '23' AND year='2010';
• Export data from a Hive table to HDFS.
EXPORT TABLE company.person TO '/department';
• Import data from HDFS to a Hive table.

IMPROT TABLE company.person FROM '/department';
• Insert data.
INSERT INTO TABLE company.person
SELECT id, name, age, birthday FROM company.person_tmp
WHERE century= '23' AND year='2010';
• WHERE.
SELECT id, name FROM employee WHERE salary >= 10000;
• GROUP BY.
SELECT department, avg(salary) FROM employee GROUP BY department;
• UNION ALL.
SELECT id, salary, date FROM employee_a UNION ALL
SELECT id, salary, date FROM employee_b;
• JOIN.
SELECT a.salary, b.address FROM employee a JOIN employee_info
b ON a.name=b.name;
• Subquery.
SELECT a.salary, b.address FROM employee a JOIN (SELECT
address FROM employee_info where province='zhejiang') b ON
a.name=b.name;
Summary
This module describes the following information about Hive: basic principles, application scenarios, enhanced
features in FusionInsight and common Hive SQL statements.
Quiz
• Which of the following scenarios does Hive apply to?

A. Real-time online data analysis.
B. Data mining (user behavior analysis, interest analysis, and partition demonstration).
C. Data aggregation (daily / weekly click count and click count rankings).
D. Non-real-time data analysis (log analysis and statistics analysis).
Quiz
• Which of the following statements about Hive SQL operations are correct?
A. The keyword external is used to create an external table and the key word internal is used to create a common
table.
B. Specify the location information when creating an external table.
C. When data is uploaded to Hive, the data source must be one HDFS path.
D. When creating a table, column delimiters can be specified.
More Information
• Exam outline:
• Mock exam:
THANK YOU!

of Streaming

Real-time stream processing
A
System architecture of Streaming
B
Objectives
to know:
Key features of Streaming
C
Basic CQL concepts
D
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL
CONTENTS
01 02 03 04
Streaming Overview
S treaming is a distributed real-time computing

framework based on the open source Storm with the
You Tube Facebook WeChat Weibo
following features: No waiting; Results delivered in-flight
• Real-time response with low delay Event

Data
Alerts
Actions
• Data computing before storing

Queries
• Continuous query
STORAGE
• Event-driven
Application Scenarios of Streaming
Streaming is applicable to the following scenarios:

Real-time
Real-time analysis: Real-time statistic:
recommendation:
Real-time log processing and Real-time website access statistics Real-time advertisement
vehicle traffic analysis and sorting positioning and event
marketing
Position of Streaming in FusionInsight
OpenAPI / SDK REST / SNMP / Syslog

System
Service
Hive M/R Spark Streaming Flink governance
Hadoop YARN / ZooKeeper LibrA

Security
management
HDFS / HBase
Streaming is a distributed real-time computing framework, widely used in

real-time business.
Comparison with Spark Streaming
t1
live lnput Stream t2

Spark t3
t t t streaming …
tn
Generate RDD and start
Spark
r1 r2 …
Task Scheduler Spark batch jobs
to execute RDD
batches of results Memory Manager transformations
Micro-batch processing by Spark Streaming Stream processing by Streaming
Spark Streaming Streaming

Instant execution logic startup and Execution logic startup before execution, and
Task execution mode
reclamation upon completion logic persistence
Event processing Processing started upon accumulation of a
Real-time event processing
mode certain number of event batches
Delay Second-level Millisecond-level
Throughput High (2 to 5 times that of Streaming) Average
Comparison of Application Scenario
Real-time Performance
Streaming
Spark Streaming
Time
milliseconds seconds
• Streaming applies to delay-sensitive services.

• Spark Streaming applies to delay-insensitive services.
CONTENTS
01 02 03 04
Basic Concepts (1)
Topology
A real-time application in Streaming.
Nimbus
Assigns resources and schedules tasks.
Supervisor
Receives tasks assigned by Nimbus, and starts/stops Worker processes.
Worker
Runs component logic processes.
Spout
Generates source data flows in a topology.
Bolt
Receives and processes data in a topology.
Basic Concepts (1)
Topology
A real-time application in Streaming.
Nimbus
Assigns resources and schedules tasks.
Supervisor
Receives tasks assigned by Nimbus, and starts/stops Worker processes.
Worker
Runs component logic processes.
Spout
Generates source data flows in a topology.
Bolt
Receives and processes data in a topology.
Basic Concepts (2)
Task
A Spout or Bolt thread of Worker.
Tuple
Core data structure of Streaming. It is basic message delivery unit in key-
value pairs, which can be created and processed in a distributed way.
Stream
An infinite continuous Tuple sequence.
Zookeeper
Provides distributed collaboration services for processes. Active / Standby Nimbus,
Supervisor, and Worker register their information in ZooKeeper. This enables Nimbus to
detect the health status of all roles.
System Architecture
Submits a Monitors the heartbeat

topology. and assigns tasks.
Client Nimbus
ZooKeeper
Downloads a JAR package.

ZooKeeper
Supervisor Supervisor Obtains tasks

ZooKeeper
Starts Worker
Worker Executor
Worker Reports the heartbeat.
Worker Executor
Topology
• A topology is a directed acyclic graph (DAG) consisting of Spout (data source) and Bolt (for logical
processing). Spout and Bolt are connected through Stream Groupings.
• Service processing logic is encapsulated in topologies in Streaming.
Filters data
Obtains stream data
from external data
sources Bolt A Bolt B
Spout Triggers external

messages
Bolt C
Persistent
archiving
Worker
Worker
A Worker is a JVM process and a topology runs in one or more
Workers. A started Worker runs all the way unless manually Worker Process
stopped. The number of Worker processes depends on the
topology setting, and has no upper limit. The number of Worker
processes that can be scheduled and started depends on the
Executor
number of slots configured in Supervisor.
Executor
Task
Executor
Task
In a Worker process runs one or more Executor threads.
Each Executor can run one or more task instances of either
Spout or Bolt. Executor
Task
Task Task
A unit that processes data.
Task
B oth Spout and Bolt in a topology support concurrent running. In the topology, you can specify
the number of concurrently running tasks on each node. Streaming assigns tasks in the cluster to
enable simultaneous calculation and enhance processing capability of the system.
Stream Bolt A
Grouping Bolt B
Spout
Bolt C
Message Delivery Policies
Grouping Mode Description

Delivers messages in groups to tasks of the target Bolt according
fieldsGrouping (field grouping)
to message hash values.
globalGrouping (global grouping) Delivers all messages to a fixed task of the target Bolt.
shuffleGrouping (shuffle grouping) Delivers messages to a random task of the target Bolt.
Delivers messages randomly to tasks if one or more tasks exist
localOrShuffleGrouping (local or shuffle
in the target Bolt process, or delivers messages in shuffle
grouping)
grouping mode.
allGrouping (broadcast grouping) Delivers messages to all tasks of the target Bolt.
Delivers messages to the task of the target Bolt specified by the
directGrouping (direct grouping) data producer. The task ID needs to be specified by using the
emitDirect (taskID, tuple) interface.
partialKeyGrouping (partial field grouping) Balanced field grouping.
noneGrouping (no grouping) Same as shuffle grouping.
293
Message Delivery Policies
Grouping Mode Description

Delivers messages in groups to tasks of the target Bolt according
fieldsGrouping (field grouping)
to message hash values.
globalGrouping (global grouping) Delivers all messages to a fixed task of the target Bolt.
shuffleGrouping (shuffle grouping) Delivers messages to a random task of the target Bolt.
Delivers messages randomly to tasks if one or more tasks exist
localOrShuffleGrouping (local or shuffle
in the target Bolt process, or delivers messages in shuffle
grouping)
grouping mode.
allGrouping (broadcast grouping) Delivers messages to all tasks of the target Bolt.
Delivers messages to the task of the target Bolt specified by the
directGrouping (direct grouping) data producer. The task ID needs to be specified by using the
emitDirect (taskID, tuple) interface.
partialKeyGrouping (partial field grouping) Balanced field grouping.
noneGrouping (no grouping) Same as shuffle grouping.
294
CONTENTS
01 02 03 04
Nimbus HA
ZooKeeper cluster
Streaming cluster
Active Nimbus Standby Nimbus
Supervisor Supervisor Supervisor

…
worker worker worker worker worker worker
Disaster Recovery
S ervices are automatically migrated from faulty nodes to normal ones, preventing
service interruptions.
Node1 Node2 Node3

Topo1 Topo1 Topo1
Topo2 Topo3 Topo4
Zero
manual
operation
Node1 Node2 Node3
Topo1 Topo1 Topo1
Topo2 Topo3 Topo4

Topo1 Topo3
Message Reliability
Reliability Processing
Description
Level Mechanism
This mode involves the highest throughput and applies to messages with
At Most Once None
low reliability requirements.
This mode involves low throughput and applies to messages with high
At Least Once Ack
reliability problems. All data must be completely processed.
Trident is a special transactional API provided by Storm and involves the
Exactly Once Trident
lowest throughput.
W hen a tuple is completely processed in Streaming, the tuple and all its derived tuples are successfully
processed. A tuple fails to be processed if the processing is not complete within the timeout period.
B
B
A D
A
C
C
E
Ack Mechanism
• When Spout sends a tuple, it notifies Acker that a new root

message is generated. Acker will create a tuple tree and initializes Ack6
Spout
the checksum to 0.
Ack1
• When Bolt sends a message, it sends an anchor tuple to Acker to
refresh the tuple tree, and reports the result to Acker after the Bolt1 Bolt2 Ack2
message is sent successfully. If the message is sent successfully, Acker
Ack3
Acker refreshes the checksum. If the message fails to be sent,
Acker immediately notifies Spout of the failure. Ack4
Bolt3 Bolt4
• When a tuple tree is completely processed (checksum = 0),
Acker notifies Spout of the result.
Ack5
• Spout provides ack () and fail () functions to process Acker results.

The fail () function implements message resending logic.
Reliability Level Setting
If not every message is required to be processed (allowing some message loss), the
reliability mechanism can be disabled to ensure better performance.
The reliability mechanism can be disabled in the following ways:
• Setting Config. • Using Spout to send messages • Using Bolt to send messages in
TOPOLOGY_ACKERS to 0. through interfaces that do not restrict Unanchor mode.
message IDs.
Streaming and Other Components
HDFS, HBase, Kafka...
Kafka Streaming HDFS
Topic1 Topology1 Redis
Topic2 Topology2 HBase
Kafka
Topic N Topology N
……
External components such as HDFS and HBase are integrated to facilitate real-time
offline analysis.
CONTENTS
01 02 03 04
StreamCQL Overview
S treamCQL (Stream Continuous Query Language) is a query language based on the distributed stream
processing platform based on and can be built on various stream processing engines (mainly Apache Storm).
Currently, most stream processing platforms provide only distributed processing capabilities but involve complex
service logic development and poor stream computing capabilities. The development efficiency is low due to low reuse
and repeated development. StreamCQL provides various distributed stream computing functions, including traditional
SQL functions such as filtering and conversion, and new functions such as stream-based time window computing,
window data statistics, and stream data splitting and merging.
303
StreamCQL Easy to Develop
//Def Input：
public void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector) {…} StreamCQL
public void nextTuple() {…}
public void ack(Object id) { …}
public void --Def Input：
declareOutputFields(OutputFieldsDeclarer CREATE INPUT STREAM S1 …
declarer) {…}
//Def logic：
public void execute(Tuple tuple,
BasicOutputCollector collector) {…} --Def logic：
public void INSERT INTO STREAM filterstr SELECT *
declareOutputFields(OutputFieldsDeclarer FROM S1 WHERE name="HUAWEI";
ofd) {…}
//Def Output：
public void execute(Tuple tuple, --Def Output：
BasicOutputCollector collector) {…} CREATE OUTPUT STREAM S2…
public void
declareOutputFields(OutputFieldsDeclarer
ofd) {…} --Def Topology:
//Def Topology: SUBMIT APPLICATION test;
public static void main(String[] args) throws
Exception {…}
Native Storm API

StreamCQL and Stream Processing Platform
Service interface CQL IDE
Function
Join Aggregate Split Merge Pattern Matching
Stream Window
Engine
Storm Other stream processing engines
Summary
This module describes the following information about Streaming:
• Definition
• Application Scenarios
• Position of Streaming in FusionInsight
• System architecture of Streaming
• Key features of Streaming
• Introduction to StreamCQL
Quiz

How is message reliability guaranteed in
Streaming?
Quiz
• Which of the following statements about Supervisor is CORRECT?
A. Nimbus HA supports hot failover and eliminates single points of failure.

B. Supervisor faults can be automatically recovered without affecting running services.
C. Worker faults can be automatically recovered.
D. Tasks on a faulty node of the cluster will be re-assigned to other normal nodes.
Quiz
• Which of the following statements about Supervisor is CORRECT?
A. Supervisor assigns resources and schedules tasks.

B. Supervisor receives tasks assigned by Nimbus, and starts / stops Worker processes.
C. Supervisor runs component logic processes.
D. Supervisor receives and processes data in a topology.
More Information
• Exam outline:
• Mock exam:
THANK YOU!

of Flink

Technical principles
of Flink A
Objectives
Key features of Flink
B
understand:
Flink integration in FusionInsight HD
C
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture FusionInsight HD
of Flink
CONTENTS
01 02 03
of Flink
Flink Overview
• Flink is a unified computing framework that supports both batch

processing and stream processing. It provides a streaming data
processing engine that supports data distribution and parallel
computing. Flink features stream processing, and is a top open-source
stream processing engine in the industry.
• Flink, similar to Storm, is an event-driven real-time streaming system.
Key Features of Flink
Flink
Streaming-first Fault-tolerant Scalable Excellent performance
Stream Reliability and Scaling out to High throughput

processing engine checkpoint mechanism over 1000 nodes and low latency
Key Features of Flink
Low Latency Exactly Once HA Scale-out
Millisecond-level Asynchronous snapshot Active / standby Manual scale-out

processing capability. mechanism, ensuring JobManagers, preventing supported by
that all data is processed single points of failure TaskManagers.
only once. (SPOFs).
Application Scenarios of Flink
Flink provides high-concurrency data processing,

millisecond-level latency, and high reliability, making it
extremely suitable for low-latency data processing
scenarios.
Typical scenarios:
• Internet finance services.
• Clickstream log processing.
• Public opinion monitoring.
Hadoop Compatibility
red
map join
join
Flink supports YARN and can obtain data from the Hadoop distributed file system (HDFS) and HBase.
Flink supports all formatted input and output of Hadoop.
Flink supports the Mappers and Reducers of Hadoop, which can be used together with Flink operations.
Flink can run Hadoop jobs faster.
Performance Comparison of Stream Computing
Frameworks
Storm & Flink Identity Single-Thread Throughput
400000
350466.22
350000
Throughput (pieces/s)
300000 277792.60
250000
200000
150000
87729.76
100000 76519.48
50000
0
1 partition source 8 partition source
Storm Flink
CONTENTS
01 02 03
of Flink
Flink Technology Stack
Machine Learning
Graph Processing
Event Processing
Relational
Relational
FlinkML
APIs & Libraries
Table
Table
Gelly
CEP
DataStream API DataSet API

Stream Processing Batch Processing
`
Core
Runtime
Distributed Streaming Dataflow
Deploy
Local Cluster Cloud

Single JVM Standalone, YARN GCE, EC2
Core Concept of Flink - DataStream
D ataStream: Flink uses data streams to represent DataStream in applications. DataStream can be considered as
an unchangeable collection of duplicate data. The number of DataStream elements is unlimited.
window (…). apply (…)

JoinedStreams
connect (DataStream)
ConnectedStreams
join (DataStream) map ( )
flatMap ( )
windowAll ( )
CoGroupedStreams
DataStream AllWindowedStream
reduce ( )
window (…). apply (…) fold ( )
sum ( )
max ( )
coGroup (DataStream)
…
keyBy ( )
KeyedeStream
window ( )
WindowedStream
DataStream
Data source: indicates the streaming data source, which can be HDFS files, Kafka
data, or texts.
Transformations: indicates streaming data conversion.
Data sink: indicates data output, which can be HDFS files, Kafka data, or texts.
Data Source Transformations Data Sink
Data Source of Flink
Batch processing Stream processing
Files Files
• HDFS, local file system, and Socket streams
MapR file system Kafka
• Text, CSV, Avro, and RabbitMQ
Hadoop input formats
Flume
JDBC Collections
HBase Implement your own
Collections • SourceFunction. collect
DataStream Transformations
Common
transformations
public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper)

public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper)
public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter)
public KeyedStream<T, Tuple> keyBy(int... fields)
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field)
public DataStream<T> rebalance()
public DataStream<T> shuffle()
public DataStream<T> broadcast()
public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes)
…
DataStream Transformations
flatMap
writeAsText
HDFS 1 3 Window / Join
HDFS
textFile
map keyBy 6
2 4 5
Flink Application Running Process - Key Roles
• Indicates the request initiator, which submits application requests and creates the
Client data flow.
• Manages the resources for applications. JobManager applies to ResourceManager

JobManager for resources based on the requirements of applications.
ResourceManager • Indicates the resource management department, which schedules and allocates the
of YARN resources of the entire cluster in a unified manner.
• Performs computing work. An application will be split and assigned to multiple

TaskManager TaskManagers for computing.
Flink Job Running Process
(Worker) (Worker)
TaskManager TaskManager
Task Task Task Task Task Task
Slot Slot Slot Slot Slot Slot
Task Task Task Task
Memory & I/O Manager Memory & I/O Manager

Network Manager Data Streams Network Manager
Actor System Actor System
Flink Program
Task Status
Program Heartbeats Deploy / Stop /
Program code Statistics Cancel Tasks
Dataflow …
Trigger
Optimizer / Client Checkpoints
Status …
Graph Builder Statistics &
Actor updates
results
JobManager
System
Dataflow graph Actor System
Submit job
(send dataflow) Cancel /
update job Dataflow Graph Scheduler
Checkpoint
Coordinator
(Master / YARN Application Master)
Flink on YARN
YARN Resource
Manager
2.Register resources
and request AppMaster
container
3.Allocate AppMaster Container
“Master” Node YARN Container YARN Container YARN Container
Flink Flink Flink Flink

YARN Client JobManager TaskManager TaskManager
YARNApp. 4.Allocate Worker

Master
1.Store Uber jar

And configuration
HDFS Always Bootstrap containers

With Uber jar and config
Technical Principles of Flink (1)
• A Flink application consists of streaming data and transformation operators.

• Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operator
that takes one or more streams as input, and produces one or more output streams as a result.
Stream
Transformation
Technical Principles of Flink (2)
T he source operator is used to load streaming data. Transformation operators, such as map ( ), keyBy ( ),
and apply ( ), are used to process streaming data. After streaming data is processed, the sink writes the
processed streaming data into related storage systems, such as HDFS, HBase, and Kafka.
Source Operator Transformation Operator Sink Operator
keyBy (
Source map ( ) ) Sink
apply ( )
Stream
Streaming Dataflow
Parallel DataStream of Flink
Streaming Dataflow (condensed view)
keyBy ( )
Source map ( ) Sink
apply ( )
Operator Stream
Source map ( )
[1] keyBy ( )
[1]
apply ( )
[1]
Operator Stream Sink
Subtask Partition parallelism = 2 [1]
keyBy ( )
Source map ( ) parallelism = 1
apply ( )
[2] [2]
[2]
Streaming Dataflow (parallelized view)
Operator Chain of Flink
Streaming Dataflow (condensed view)
Source map() keyBy()

Sink
apply()
Operator Chain Task
keyBy()
Source map() apply()
[1] [1] [1]
Sink
Subtask (=thread) Subtask (=thread) [1]
keyBy()
Source map()
apply()
[2] [2]
[2]
Streaming Dataflow (parallelized view)
Windows of Flink
F link supports operations based on time windows and operations based on data windows.
• Categorized by splitting standard: time windows and count windows.
• Categorized by window action: tumbling windows, sliding windows, and custom windows.
Event
Time windows
Count (3) Windows Event stream
Common Window Types of Flink (1)
T umbling windows, whose times do not overlap.
window 1 window 2 window 3 window 4 window 5

user1
user2
user3
window size time
S liding windows, whose times overlap.
window 1 window 3
user1
user2
user3
window 2 window 4
time
window size window size
S ession windows, which are considered completed if there is no data within the preset time period.
window 1 window 2 window 3 window 4
user1
window 1 window 2 window 3 window 4
user2
window 1 window 2 window 3
user3
session gap
time
Fault Tolerance of Flink
The checkpoint mechanism is a key fault tolerance measure of Flink.
The checkpoint mechanism keeps creating status snapshots of stream applications. The status
snapshots of the stream applications are stored at a configurable place (for example, in the memory
of JobManager or on HDFS).
The core of the distributed snapshot mechanism of Flink is the barrier. Barriers are periodically inserted
into data streams and flow as part of the data streams.
New tuple DataStream Old tuple
Checkpoint barrier n Checkpoint barrier n-1
Part of Part of Part of

Checkpoint n+1 Checkpoint Checkpoint n-1
Checkpoint Mechanism (1)
• The checkpoint mechanism is the reliability pillar stone of

Flink. When an exception occurs on an operator in the Flink
cluster (for example, unexpected exit), the checkpoint
mechanism can restore all application statuses at a previous
time so that all statuses are consistent.
• This mechanism ensures that when a running application fails,

all statuses of the application can be restored from a
checkpoint so that data is processed only once. Alternatively,
you can choose to process data at least once.
Barrier
Source Intermediate Sink
operator operator operator
CheckpointCoordinator
Barrier
Snapshot
Barrier
Snapshot

Snapshot
Barrier of source A
A
C D
B Barrier of source B
Barrier of source A
A
C D
B Barrier of source B
Snapshot
A Merged barrier
C D
B
CONTENTS
01 02 03
and Architecture of FusionInsight HD
Flink
Location of Flink in FusionInsight Products

System
Service
Hive MapReduce Spark Storm Flink
governance
management
HDFS / HBase
• FusionInsight HD provides a Big Data processing environment and selects the best practice in the industry
based on scenarios and open source software enhancement.
• Flink is a unified computing framework that supports both batch processing and stream processing. Flink
provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability.
Flink Web UI
T he FusionInsight HD platform provides a visual management and monitoring UI for Flink. You
can use the YARN Web UI to query the running status of Flink tasks.
Interaction of Flink with Other Components
In the FusionInsight HD cluster, Flink interacts

with the following components:
HDFS
• (mandatory) Flink reads and writes data in HDFS.
YARN
• (mandatory) Flink relies on YARN to schedule and manage
resources for running tasks.
ZooKeeper
• (mandatory) Flink relies on ZooKeeper to implement the
checkpoint mechanism.
Kafka
• (optional) Flink can receive data streams sent from Kafka.
Summary
• These slides describe the following information about Flink: basic concepts, application scenarios,
technical architecture, window types, and Flink on YARN.
• These slides also describe Flink integration in FusionInsight HD.
Quiz
• What are the key features of Flink?

• What are the common window types of Flink?
More Information
• Exam outline:
• Mock exam:
THANK YOU!

of Loader

What Loader is A
What Loader can be used for B
Position of Loader in FusionInsight C
System architecture of Loader D
Objectives
Upon completion of this course, you will be able Main features of Loader E
to know:
How to manage Loader jobs F
How to monitor Loader jobs G
CONTENTS
01 02
Introduction to Loader Loader Job Management
01 02
What Is Loader
• Loader is a loading tool for data and file exchange between FusionInsight HD and relational
databases and file systems. Loader provides a wizard-based job configuration management
Web UI and supports timed task scheduling and periodic Loader job implementation. On the
Web UI, users can specify multiple data sources, configure data cleaning and conversion
steps and the cluster storage system.
Application Scenarios of Loader
RDB
Hadoop
SFTP Server
Loader • HDFS
FTP Server
• HBase
• Hive
Customized
Data Source
Position of Loader in FusionInsight


DataFarm Loader Miner Farmer Manager
System
Service
governance
management
HDFS / HBase
Loader is a loading tool for data and file exchange between FusionInsight HD and
relational databases and file systems.
Features of Loader
Loader
GUI
• Provides a GUI that facilitates operations.
Security
• Kerberos authentication.
Highly Reliability
• Deploys Loader Servers in active / standby mode.
• Uses MapReduce to execute jobs and supports
retry after failure. High Performance
• Leaves no junk data after a job failure occurs. • Uses MapReduce for parallel data processing.
Module Architecture of Loader
Loader External Data Source

Loader Client
Tool Web UI JDBC File
REST API JDBC SFTP / FTP
Transform Engine
Job
Execution Engine
Scheduler
Submission Engine Yarn Map Task
Job Manager
HBase
Metadata Repository
HDFS Reduce Task
HA Manager
Hive
Loader Server
Module Architecture of Loader - Module Description
Module Description
Loader Client Provides a web user interface (Web UI) and a command-line interface (CLI).
Processes operation requests sent from the client, manages connectors and
Loader Server
metadata, submits MapReduce jobs, and monitors MapReduce job status.
Provides a Representational State Transfer (REST ful) interface (HTTP + JSON) to
REST API
process the operation requests from the client.
Job Scheduler Periodically executes Loader jobs.
A data transformation engine that supports field combination, string cutting, and string
Transform Engine
reverse.
Execution Engine Executes Loader jobs in MapReduce manner.
Submission Engine Submits Loader jobs to MapReduce.
Manages Loader jobs, including creating, querying, updating, deleting, activating /
Job Manager
deactivating, starting and stopping jobs.
Metadata warehouse, which stores and manages connectors, conversion steps, and
Metadata Repository
Loader jobs.
Manages the standby and active status of Loader Servers. Two Loader Servers are
HA Manager
deployed in active / standby mode.
361
01 02
Service Status Web UI of Loader
• Choose Services > Loader to go to the Loader Status page.
363
Job Management Web UI of Loader
• On the Loader Status page, click Loader Server (Active)

to go to the job management Web UI of Loader.
Job Management Web UI of Loader - Job
• A job describes the process of extracting,

transforming, and loading data from the
data source to the target end. It includes
data source location and attributes, rules for
source-to-target data conversion, and target
end attributes.
Job Management Web UI of Loader - Job
Conversion Rules
Loader Conversion Operators:
• Long Date Conversion: performs long integer and date conversion.
• If Null: converts null values into specified values.
• Add Constants: generates constant fields.
• Generate Random: generates random value fields.
• Concatenate Fields: concatenates existing fields to generate new fields.
• Extracts Fields: separates an existing field by using specified delimiters to generate new fields.
• Modulo Integer: performs modulo operations on existing fields to generate new fields.
• String Cut: cuts existing string fields by the specified start position and end position to generate new fields.
Creating a Loader Job - Basic Information
367
Creating a Loader Job - From
368
Creating a Loader Job - Transform
369
Creating a Loader Job - To
* Storage type HDFS
* File type TEXT_FILE
Compression format Choose…
* Output directory /user/test
File operate type OVERRIDE

Extractor Extractor size
* Number 2
Back Save Save and run Cancel
Copyright
Copyright ©2019
© Huawei
2019Technologies Co., Ltd. All rights reserved.
Huawei Technologies Co., Ltd. All rights reserved. Page
Page 37019
Monitoring Job Execution Status
Check the execution status of all jobs:
• Go to the Loader job management page.
• The page displays all current jobs and last execution status.
• Select a job, and click a button in the Operation column to perform a

corresponding operation.
Monitoring Job Execution Status - Job Execution
History
View execution records of specified jobs:

• Select a job, and then click the History button in the Operation column.
• The historical record page displays the start time, duration (s), status, failure cause,
number of read / written / skipped rows / files, dirty data link, and MapReduce log
link of each execution.
Monitoring Job Execution Status - Dirty Data
Dirty data refers to those that does not meet Loader conversion rules, which can be
checked with the following steps.
• If the number of skipped job records is not 0 on the job history page, click the dirty data button to go to the
dirty data directory.
• Dirty data is stored in HDFS, and the dirty data generated by each Map job is stored in a separate file.
Permission Owner Group Size Last Modified
drwx------ admin hadoop 0B Thu Apr 07 14:13:03 2016
373
Monitoring Job Execution Status - MapReduce Log
• On the job history page, click the log button.

The MapReduce log page for the execution
is displayed.
Monitoring Job Execution Status - Job Execution
Failure Alarm
When a job fails to be executed, an alarm

is reported.
Summary
This module describes the following information about Loader: main functions and features,
job management and monitoring.
Quiz
• True or False:
A. FusionInsight Loader supports only data import and export between relational databases and
Hadoop HDFS and HBase.
B. Conversion steps must be configured for Loader jobs.
Quiz
• Which of the following statements are CORRECT?

A. No residual original files are left when a job fails after proper running for some time.
B. Dirty data refers to the data that does not comply with conversion rules.
C. Loader client scripts can only be used to submit jobs.

D. A human-machine account can be used to perform operations on all Loader jobs.
Quiz
• Which of the following statements is CORRECT?

A. If Loader is faulty after it submits a job to MapReduce, the job will fail to be executed.
B. If a Mapper execution fails after Loader submits a job to MapReduce, a second execution is automatically performed.
C. Residual data generated after a Loader job fails to be executed needs to be manually cleared.
D. After Loader submits a job to MapReduce for execution, it cannot submit other jobs before the execution is complete.
More Information
• Exam outline:
• Mock exam:
THANK YOU!

of Flume

Foreword
• Flume is an open-source log system, which is a distributed, reliable, and

high-available massive log aggregation system. Flume supports
customization of data senders and receivers for collecting, processing
and transferring data.
What Flume is A
Functions of Flume B
Position of Flume
in FusionInsight C
Objectives System architecture
of Flume D
to know: Key characteristics
of Flume E
Application Examples
of Flume F
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture
CONTENTS
01 02 03
Architecture
What is Flume
F lume is a streamed log collection tool. Flume can

roughly processes data and writes data to customizable data
receivers. Flume can collect data from various data sources
such as local files (spool directory source), real-time logs
(taildir and exec), REST message, Thrift, Avro, Syslog, and
Kafka.
Functions of Flume
Flume can collect logs from a specified directory and save the logs in a
01 specified path (HDFS, HBase, and Kafka).
02 Flume can collect and save logs (taildir) to a specified path in real time.
Flume supports the cascading mode (multiple Flume nodes interwork with
03 each other) and data aggregation.
04 Flume supports customized data collection.
Position of Flume in FusionInsight

DataFarm Flume Miner Farmer Manager
System
Service
governance
management
HDFS / HBase
Flume is a distributed framework for collecting and aggregating

stream data.
Architecture of Flume (1)
• Basic Flume architecture: Flume can directly collect data on a single node. This architecture is mainly applicable to
data collection within a cluster.
Log Source Channel Sink HDFS
• Multi-agent architecture of the Flume: Multiple Flume nodes can be connected. After collecting initial data from data
sources, Flume saves the data in the final storage system. This architecture is mainly applicable to the import of data
outside to the cluster.
Source Source
Sink Sink
Log HDFS
Channel Channel
Architecture of Flume (2)
Interceptor events
Channel
events
events events events
Channel Channel
Source Channel
Processor Selector
events
Sink Sink
Sink
Runner Processor
Basic Concept - Source (1)
The source receives events or generates events based on

special mechanisms. The source can save events to one
channel or multiple channels in batches. The sources are
classified into event driven sources and event polling sources.
• Event-driven source: The external source actively sends data to Flume to drive Flume to
accept the data.
• Event polling source: Flume periodically obtains data in an active manner.
The source must be associated with at least one channel.
Basic Concept - Source (2)
Source Type Description

Runs a certain command or script, and outputs the execution results
exec source
as a data source.
Provides an Avro-based server. It binds the server with a port so that
avro source
the server waits for the data sent from the Avro-based client.
thrift source The same as the avro source. The transmission protocol is Thrift.
http source Supports data transmission based on HTTP POST.

syslog source Collects the syslog logs.
spooling directory source Collects local static files.
jms source Obtain data from the message queue.
Kafka source Obtain data from the Kafka.
393
Basic Concept - Channel (1)
The channel is located between the source and the sink. The channel functions similar to the
queue. It temporarily saves events. When the sink successfully sends events to the next-hop
channel or the destination, the events are removed from the current channel.
The persistence levels vary with channels.

• Memory channel: The persistence is not supported.
• File channel: The persistence is achieved based on the Write-Ahead Log (WAL).
• JDBC channel: The persistence is achieved based on the embedded database.
Channels support transactions and provide weak sequence assurance. Channels can connect
any quantity of sources and sinks.
Basic Concept - Channel (2)
Memory channel
Messages are saved in the memory. This channel supports high
throughput but no reliability. Data may be lost.
File channel
It supports data persistence. However, the configuration is
complex. Both the data directory and the checkpoint directory
need to be configured. Checkpoint directories need to be
configured for different file channels.
JDBC channel
It is the embedded Derby database. It supports event persistence
and high reliability. It can replace the file channel that also
supports persistence.
Basic Concept - Sink (1)
• The sink transmits events to the next hop or

destination. After the events are successfully
transmitted, they are removed from the current
channel.
• The sink must bind to a specific channel.
Basic Concept - Sink (2)
Sink Type Description

hdfs sink Writes the data in the HDFS.
avro sink Transmits data to the next-hope Flume node using the Avro protocol.
thift sink The same as the avro sink. The transmission protocol is Thrift.
file roll sink Saves data in the local file system.
HBase sink Writes data in the HBase.
Kafka sink Writes data in the Kafka.
MorphlineSolr sink Writes data in the Solr.
397
CONTENTS
01 02 03
Architecture
Log Collection
• Flume can collect logs beyond a cluster and archive the logs in the HDFS, HBase,
and Kafka for data analysis and cleaning by upper-layer applications.
Log Source Channel Sink HDFS
Log Source Channel Sink HBase
Log Source Channel Sink Kafka
Multi - level Cascading and Multi - channel Duplication
• Multiple Flume nodes can be cascaded. The cascaded nodes support internal
data duplication.
Source
Channel
Log
Sink
Channel Sink HDFS

Source
Channel Sink HBase
Message Compression and Encryption by
Cascaded Flume Nodes
• Data transmitted between cascaded Flume nodes can be compressed and encrypted,
thereby improving the data transmission efficiency and security.
Flume
RPC
Compression and
Decompression HDFS / Hive /
encryption
and decryption HBase / Kafka
Flume API
Data Monitoring
FusionInsight
Flume monitoring information
Manager
Flume
Application
Received data size Transmitted data size
Source Data buffer size Sink HDFS / Hive /

Flume API HBase / Kafka
Channel
Transmitted data size
Transmission Reliability
• Flume adopts the transaction management mode for data transmission. This mode ensures the
data security and enhances the reliability during transmission. In addition, if the file channel is used
to transmit data buffered in the channel, the data is not lost when a process or node is restarted.
Channel Sink Source Channel
Start tx
Take events
Send events
Start tx
Put events
End tx End tx
Transmission Reliability (Failover)
• During data transmission, if the next-hop Flume node is faulty or receives data abnormally, the
data is automatically switched over to another path.
Source Sink
HDFS
Source
Channel
Sink
Log
Channel
Source Sink
Sink
HDFS
Channel
Data Filtering During Transmission
• During data transmission, Flume roughly filters and cleans the data. The unnecessary data is
filtered. In addition, you can develop filter plug-ins based on the data particularity if you need to
filter complex data. Flume supports the third-party filter plug-ins.
Interceptor
Channel
events events
Channel Channel
Source
Processor Selector events
Channel
CONTENTS
01 02 03
Architecture
Flume Example 1 (1)
Data
Description preparations
• In this application scenario, • Create a log directory /

Flume collects logs from an tmp / log_test on a node
application (for example, the in the cluster.
online banking system) • Take this directory as the
outside the cluster and saves monitoring directory.
the logs in the HDFS.
Flume Example 1 (2)
Download the
Flume Client
Log in to the FusionInsight HD cluster. Choose Service Management >

Flume > Download Client.
Flume Example 1 (3)
• Install Flume client:

Decompress the client
Tar -xvf FusionInsight_V100R002C60_Flume_Client.tar
Tar -xvf FusionInsight_V100R002C60_Flume_ClientConfig.tar
Cd FussionInsight_V100R002C60_Flume_ClientConfig/Flume
Tar -xvf FusionInsight-Flume-1.6.0.tar.gz
Install the client

./install.sh -d /opt/FlumeClient -f hostIP -c
flume/conf/client.properties.properties
Flume Example 1 (4)
• Configure flume source
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_test
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.deserializer = LINE
server.sources.a1.selector.type = replicating
server.sources.a1.fileHeaderKey = file
server.sources.a1.fileHeader = false
server.sources.a1.channels = ch1
Flume Example 1 (5)
• Configure flume channel
# the channel configuration of ch1

server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20
Flume Example 1 (6)
• Configure flume sink
server.sinks.s1.type = hdfs
server.sinks.s1.hdfs.path = /tmp/flume_avro
server.sinks.s1.hdfs.filePrefix = over_%{basename}
server.sinks.s1.hdfs.inUseSuffix = .tmp
server.sinks.s1.hdfs.rollInterval = 30
server.sinks.s1.hdfs.rollSize = 1024
server.sinks.s1.hdfs.rollCount = 10
server.sinks.s1.hdfs.batchSize = 1000
server.sinks.s1.hdfs.fileType = DataStream
server.sinks.s1.hdfs.maxOpenFiles = 5000
server.sinks.s1.hdfs.writeFormat = Writable
server.sinks.s1.hdfs.callTimeout = 10000
server.sinks.s1.hdfs.threadsPoolSize = 10
server.sinks.s1.hdfs.failcount = 10
server.sinks.s1.hdfs.fileCloseByEndEvent = true
server.sinks.s1.channel = ch1
Flume Example 1 (7)
• Name the configuration file of flume agent properties. properties.

• Upload the configuration file.
Flume Example 1 (8)
01 Move data files to the monitor directory /tmp/log_test：
mv /var/log/log.11 /tmp/log_test
02 Check if data is sinked to HDFS:
hdfs dfs -ls /tmp/flume_avro
03 log. 11 is already renamed log. 11. COMPLETED, which means success of data collection.
Flume Example 2 (1)
• In this application scenario, Flume collects real-time

Description clickstream logs and saves the logs to the Kafka, for real-
time analysis processing.
• Create a log directory /tmp/log_click on a node in the cluster. Data

• Collect data to kafka topic_1028. preparations
Flume Example 2 (2)
• Configure flume source:
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1

server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_click
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.selector.type = replicating
jserver.sources.a1.basenameHeaderKey = basename
server.sources.a1.deserializer.maxBatchLine = 1
server.sources.a1.deserializer.maxLineLength = 2048
server.sources.a1.channels = ch1
Flume Example 2 (3)
• Configure flume source:
# the channel configuration of ch1

server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20
Flume Example 2 (4)
• Configure flume sink：
# the sink configuration of s1

server.sinks.s1.type = org.apache.flume.sink.kafka.KafkaSink
server.sinks.s1.kafka.topic = topic_1028
server.sinks.s1.flumeBatchSize = 1000
server.sinks.s1.kafka.producer.type = sync
server.sinks.s1.kafka.bootstrap.servers = 192.168.225.15:21007
server.sinks.s1.kafka.security.protocol = SASL_PLAINTEXT
server.sinks.s1.requiredAcks = 0
server.sinks.s1.channel = ch1
Flume Example 2 (5)
• Upload the configuration file to flume.
• Use kafka demands to view data collected kafka topic_1028.
Summary
This course describes Flume functions and application scenarios, including the basic concepts, functions,
reliability, and configuration items. Upon completion of this course, you can understand Flume functions,
application scenarios, and configuration methods.
Quiz
• What is Flume? What are functions of the Flume?

• What are key characteristics of the Flume?
• What are functions of the source, channel, and sink?
Quiz
True or False
• Flume supports cascading. That is, multiple Flume nodes can be cascaded for
data transmission.
More Information
• Exam outline:
• Mock exam:
THANK YOU!

of Kafka

Basic concepts and application scenarios of Kafka
A
Objectives
System architecture of Kafka
B
to know:
Key processes of Kafka
C
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process
CONTENTS
01 02 03
Kafka Overview
• Definition of Kafka: Kafka is a high-throughput, distributed, and publishing-

subscription messaging system. A large messaging system can be
established on low-cost servers with Kafka technology.
Kafka Overview
• Definition of Kafka: Kafka is a high-throughput, distributed, and publishing-

subscription messaging system. A large messaging system can be
established on low-cost servers with Kafka technology.
Kafka Overview
Application scenarios
• Compared with other components, Kafka features message persistence, high throughput, distributed processing and real-
time processing. It applies to online and offline message consumption and massive data collection scenarios, such as
website active tracking, operation data monitoring of the aggregation statistics system, and log collection, etc.
Frontend Backend
Producer Producer
Flume Storm
Kafka
Hadoop Spark
Farmer
Position of Kafka in FusionInsight


System
Service
M/R Hive Kafka Spark Streaming Solr
governance
management
HDFS / HBase
Kafka is a distributed messaging system that supports online and offline message
processing and provides Java APIs for other components.
CONTENTS
01 02 03
 Kafka Write Process
 Kafka Read Process
Kafka Topology
(Producer) Front End Front End Front End Service
(Push) (Push) (Push) (Push)
ZooKeeper
Zoo Keeper
(Kafka) Broker Broker Broker Zoo Keeper
(Pull) (Pull) (Pull) (Pull)
Hadoop Real-time Other Data

(Consumer) Cluster Monitoring Service Warehouse
Kafka Topics
Consumer group 1
Consumer group 2 A consumer uses offsets to record and
read location information.
Kafka cleans up old messages
based on the time and size.
Kafka topic
… new
Older msgs Newer msgs Producer 1

Producer 2
…
Producer N
Producer appends messages
at the end of a topic.
Kafka Partition
• Each topic contains one or more partitions. Each partition is an ordered and immutable
sequence of messages. Partitions ensure high throughput capabilities of Kafka.
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
Kafka Partition
• Consumer group A has two consumers to read data from four partitions
• Consumer group B has four consumers to read data from four partitions.
Kafka Cluster
Server 1 Server 2
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Consumer group A Consumer group B
Kafka Partition Offset
• The location of a message in a log file is called offset, which is a long integer that uniquely
identifies a message. Consumers use offsets, partitions, and topics to track records.
Consumer
group C1
Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12
Partition 1 0 1 2 3 4 5 6 7 8 9 Writes
Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12
Old New
Kafka Partition Replica (1)
Kafka Cluster
Broker 1 Broker 2 Broker 3 Broker 4
Partition-0 Partition-1 Partition-2 Partition-3
Partition-3 Partition-0 Partition-1 Partition-2
Follower->Leader
Pulls data
ReplicaFetcherThread
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7
writes
old new old new
Leader Partition Follower Partition

ack
Producer
ReplicaFetcherThread
Broker 1 Broker 2 Broker 3

Leader Follower Follower
Partition-0 Partition-0 Partition-0
Leader Follower Follower
… … …
ReplicaFetcherThread ReplicaFetcherThread-1
Broker 1 Broker 2 Broker 3

Leader Leader Follower
… … Follower
Partition-1
ReplicaFetcherThread-2
Kafka Logs (1)
• A large file in a partition is split into multiple small segments. These segments facilitate
periodical clearing or deletion of consumed files to reduce disk usage.
Kafka Logs (2)
segment file 1
msg-00000000000
in-memory index
msg-00000000215
delete msg-00000000000
……
msg-00014517018
msg-00030706778 msg-00014516809
reads
……
append msg-02050706778
segment file N
msg-02050706778
msg-02050706945
……
msg-02614516809
Kafka Logs (3)
00000000000000368769.log
Message368770 0
00000000000000368769.index Message368771 139
1,0 Message368772 497
3,497 Message368773 830
6,1407 Message368774 1262
8,1686 Message368775 1407
… Message368776 1508
Message368777 1686
N,position
…
Message368769+N position
Kafka Log Cleanup (1)
• Log cleanup modes: delete and compact.
• Threshold for deleting logs: retention time limit and size of all logs in a partition.
Parameter Default Value Description Range
log.cleanup.policy delete Outdated log cleanup policy. Delete or compact
Maximum retention time of log

log.retention.hours 168 1 ~ 2147483647
files. Unit: hour.
Maximum size of log data in a

log.retention.bytes -1 partition. By default, the value is -1 ~ 9223372036854775807
not restricted. Unit: byte.
445
Kafka Log Cleanup (2)
Offset 0 1 2 3 4 5 6 7 8 9 10
Log
Key K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 Before
Compaction
Value V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
Compaction
3 4 6 8 9 10
Keys K1 K3 K4 K5 K2 K6 Log
After
Values V4 V5 V7 V9 V10 V11 Compaction
Kafka Data Reliability
• All Kafka messages are stored in hard disks and topic

partition replication is performed to ensure data reliability.
• How data reliability is ensured during message delivery?
Message Delivery Semantics
There are three data delivery modes:
At Most Once
• Messages may be lost.
• Messages are never redelivered or reprocessed.
At Least Once
• Messages are never lost.
• Messages may be redelivered and reprocessed.
Exactly Once
• Messages are never lost.
• Messages are processed only once.
Kafka Message Delivery
• Messages are delivered in different modes to ensure reliability in different

application scenarios.
Synchronous Asynchronous Asynchronous Asynchronous

Synchronous
delivery delivery delivery with delivery with
delivery with
without without confirmation confirmation
confirmation
confirmation confirmation but no retries and retries
No replicas At most once At least once At most once At least once At least once
Synchronous
replication At most once At least once At most once At least once At least once
(leader and followers)
Asynchronous
Messages may be Messages may be Messages may be
replication At most once At most once
(leader) lost or repeated lost or repeated lost or repeated
449
Kafka Cluster Mirroring
ZooKeeper
ZooKeeper
Kakfa Broker ZooKeeper
Producers Kafka Broker

consumer
Source Cluster Data

producer
Data
Mirror Maker
Target Cluster
CONTENTS
01 02 03
Write Data by Producer
Data Create
Data
Message
Publish
Message
Producer
Message
Kafka Cluster
CONTENTS
01 02 03
Read Data by Consumer
Overall process: Process

Data
Data
• A consumer connects to the leader Message
broker where the specified topic
partition is located and pulls Subscrible
messages from Kafka logs. Message
Consumer
Message
Kafka Cluster
Summary
This module describes the following information about Kafka: basic concepts and application
scenarios, system architecture and key processes.
Quiz
• Which of the following are features of Kafka?

A. High throughput.
B. Distributed.
C. Data persistence.
D. Random message read.
Quiz
• What is the component that Kafka directly depends on for running?
A. HDFS.
B. ZooKeeper.
C. HBase.
D. Spark.
Quiz
• How is Kafka data reliability ensured?

• What operations can the shell commands provided by the Kafka client be
used to perform on the topics?
More Information
• Exam outline:
• Mock exam:
THANK YOU!

ZooKeeper Cluster Distributed
Coordination Service

Concepts of ZooKeeper
A
System architecture
Objectives
of ZooKeeper B
to know:
Key features of ZooKeeper
C
CONTENTS
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight
01 02 03 04 05
FusionInsight
ZooKeeper Overview
ZooKeeper, a distributed service framework,

provides distributed and highly available service
coordination capabilities and is designed to
resolve data management issues
in distributed applications.
ZooKeeper works depending on Kerberos and

LdapServer in security mode but does not depend
on them in non-security mode. As an underlying
component, ZooKeeper is used by upper-layer
components, such as Kafka, HDFS, HBase and
Storm.
ZooKeeper Overview
Hadoop Ecosystem
Apache Interactive Analysis Stream Processing Apache

Drill Storm
Transter
HiveQL
Language
Learning
Machine
Data
Scripting
Query
PIG
Coordination Service
Column Datastore
Sqoop
ZooKeeper Mahout Hive
(Unstuctured)
Streaming
Hadoop
MapReduce Core Hadoop
Data
Hadoop HDFS Core Hadoop HBase

Flume
01 02 03 04 05
FusionInsight
Position of ZooKeeper in FusionInsight

System
Service
Hive M/R Spark Streaming Flink
governance
management
HDFS / HBase
Based on the open source Apache ZooKeeper, ZooKeeper provides services for
upper-layer components and is designed to resolve data management issues in
distributed applications.
01 02 03 04 05
FusionInsight
ZooKeeper Service Architecture - Model
ZooKeeper Service
Leader
Server Server Server Server Server
Client Client Client Client Client Client Client Client
• A ZooKeeper cluster is a group of servers. In this group, one server functions as the leader and the other servers are followers.
• ZooKeeper selects a server as the leader upon startup.
• ZooKeeper uses a user-defined protocol named ZooKeeper Atomic Broadcast (ZAB), which ensures the data consistency among nodes in the system.
• After receiving a data change request, the leader first writes the change to local disks ,then to memory for restoration.
ZooKeeper Service Architecture - Disaster
Recovery (DR)
ZooKeeper can select a server as the

leader and provide services correctly.
• An instance that wins more than half of the votes during the election
becomes the leader.
For n instances, n could be odd or even.

• If n = 2x + 1, the node that functions as the leader must win x + 1 votes and the
DR capability is x.
• If n = 2x + 2, the node that functions as the leader must win x + 2 votes and the
DR capability is x.
01 02 03 04 05
FusionInsight
Key Features of ZooKeeper
• Eventual consistency: All servers are displayed in the same view.
• Real-time capability: Clients can obtain server updates and failures within a
specified period of time.
• Reliability: A message will be received by all servers.
• Wait-free: Slow or faulty clients cannot intervene the requests of rapid clients so that
the requests of each client can be processed effectively.
• Atomicity: Data transfer either succeeds or fails, but no transaction is partial.
• Sequence: The sequence of data status updates on clients is consistent with that of
request sending.
Read Progress of ZooKeeper
• ZooKeeper consistency indicates that all servers connected to a client are displayed in the same view.
Therefore, read operations can be performed between the client and any server.
ZK 1 (F) ZK 2 (L) ZK 3 (F)
Local Storage Local Storage Local Storage
1.Read Request 2.Read Response
Client
Write Progress of ZooKeeper
3.2.Send Proposal
2.Write Request
4.1
ZK 1 (F) 3.1 ZK 2 (L) ZK 3 (F)
3.3.Send Proposal
4.2.ACK
4.3.ACK
5.1
5.3.Commit
Local Storage 5.2.Commit Local Storage Local Storage
1.Write Request
Client
6.Write Response
ACL (Access Control List)
The access control list (ACL) controls the access to ZooKeeper. It applies to specified znodes and
cannot be applied to all subnodes of the znodes. Run the setAcl / znode scheme:id:perm command to
set the ACL.
Scheme indicates the authentication mode. ZooKeeper provides four authentication modes:
• World: an ID. Any person can access ZooKeeper.
• auth: does not use any ID. Only authenticated users can access ZooKeeper.
• digest: uses the MD5 hash value generated by username : password as the authentication ID.
• IP: uses the client host IP address for authentication.
Id: checks whether authentication information is valid. The authentication methods varies with
different scheme.
Perm: indicates the permission that a user who passes ACL authentication can have for
ZooKeeper.
Log Enhancement
• Ephemeral node exists as long as the session

that created the node is active. Ephemeral node
deletion is recorded in audit logs so that
ephemeral node status can be obtained.
Commands for ZooKeeper Clients
• Invoke a ZooKeeper client: zkCli.sh-server 172.16.0.1:24002
• Create a node: create /node
• Obtain the subnodes of a node: ls /node
• Create node data: set /node data
• Obtain node data: get /node
• Delete a node: delete /node
• Delete a node and all subnodes: deleteall /node
01 02 03 04 05
FusionInsight
ZooKeeper and Streaming
ZooKeeper cluster
Streaming cluster
Active Nimbus Standby Nimbus
Supervisor Supervisor Supervisor

…
Worker Worker Worker Worker Worker Worker
ZooKeeper and HDFS
ZooKeeper
Cluster
Create NameNode NameNode Standby

Share directory firstly Monitor the data message
to be the Active of Active share directory
ZKFC
NameNode
NameNode Standby
Active
Write / Read Message
Monitoring
ZooKeeper and YARN
ZooKeeper
Cluster
Active ResourceManager Write the select message Standby ResourceManager Standby ResourceManager
Create Statestore directory to ZooKeeper firstly Monitor the select Get the message from
In ZooKeeper to be Active message of Active Statestore directory
Active ResourceManager Standby ResourceManager
Monitoring
ZooKeeper and HBase
ZooKeeper
Cluster
Write HMaster message HMaster Standby RegionServer

HMaster Active monitor
to HMaster directory monitor the folder Write its own state message
RegionServer
Firstly to be The Active in the Active directory to ZooKeeper
HMaster Active HMaster Standby
RegionServer RegionServer … RegionServer
Monitoring
Summary
This module describes the following information about ZooKeeper:
• Functions and position in FusionInsight.
• Service architecture and data models.
• Read and write progresses as well as consistency.
• Creating and permission setting of ZooKeeper nodes.
• Relationship with other components.
Quiz
• What are the functions and position of ZooKeeper in a cluster?
• Why does ZooKeeper need to be deployed on an odd number of nodes?
• What does ZooKeeper consistency mean?
More Information
• Exam outline:
• Mock exam:
THANK YOU!

FusionInsight HD
Solution Overview

Huawei big data solution FusionInsight
HD A
Objectives
The features of FusionInsight HD
B
understand:
Success cases of FusionInsight HD
C
CONTENTS
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight
01 02 03
FusionInsight
Apache Hadoop - Prosperous Open - Source Ecosystem
Hadoop Ecosystem Map
1 2 10 Workflow 12 13 More High Level

Firefox Safar
Interfaces
Netscape NUTCH Cascading Support
mahout amazon
Internet Explorer
Engine + High Level
6 Unstructured Data 5 8 4
Logic Interfaces
Hadoop HDFS Hadoop HDFS
Flume
JAQL
3 File system 9
Scribe Hive
Hadoop HDFS Intellicus Dashboards
7 Structured Data 14 OLTP 11 Monitor / Manage Hadoop Ecosystem

OLTP
RDBMS hiho Sqoop HBase Hue eclipse Karmasphere Ganglia
Big Data Is an Important Pillar for Huawei ICT Strategy
Huawei Big Data R & D
Huawei Strategy Map
Team Global Distribution
Content
and App
Third Third ISVs
Partners
Professional Service
Enterprise SDP BSS / OSS

Apps
Big Data Analytics Platform
Data Center Infrastructure
Core Network
IP+Optical
Enterprise
FBB MBB
Network • There are 8 research centers with thousands of
employees around the world.
Things (M2M Module) People (Smart Device) • World-class data mining and artificial intelligence
experts, such as PMC Committer and IEEE
Source: Huawei corporate presentation Fellow.
FusionInsight HD: From Open - Source to Enterprise Versions
Version
Security Configuration
mapping
Easy
-to-Use
Performance Baseline
Patch selection
optimization selection
Security
Hadoop HBase Log
Enterprise
Reliability version
Prosperous
community
Initial
open-source
FusionInsight Platform Architecture
Safe city Power industry Financial industry Telecom Big data cloud services
Big data cloud services

Data integration services Data processing services Real-time computing Data analysis services Machine learning services Artificial Intelligence
services Service (AIS)
Data Ingest services MapReduce Service (MRS) Stream services DWS MLS Image tagging service
DPS services, ... CloudTable RTD services, ... MOLAP services, ... Log analysis, ... NLP service, ...
FusionInsight Porter FusionInsight Miner data insight FusionInsight Farmer data intelligence FusionInsight Manager
Data integration Management platform
Weaver graphics analysis engine RTD real-time decision engine
Sqoop Miner Studio mining platform Farmer Base reasoning framework Security
Batch collection management
Performance
FusionInsight HD data processing
Flume Real-time management
collection Spark Storm / Flink
FusionInsight Elk
Collaboration service
One-stop analysis Stream processing Fault management
Standard SQL engine
framework framework FusionInsight
ZooKeeper
Kafka LibrA
Message queue Yarn resource management Parallel database Tenant management
CarbonData new file format HBase
Oozie NoSQL database Configuration
Job scheduling HDFS distributed file system management
Contribution to the Open - Source Community
Create top
Lead the community projects
community to and be recognized
complete future- by the ecosystem
oriented kernel-
level feature
Perform kernel- development
level development
to support key
service features
Be able to resolve
Be able to resolve kernel-level
kernel-level problems by teams
problems
(outstanding
individuals)
Locate peripheral
problems Large number of components and codes
Be able to use Apache open-source Frequent component update

Hadoop community ecosystem
Efficient feature integration
• Outstanding product development and delivery capabilities and carrier-class operation support capabilities empowered by the
Hadoop kernel team.
01 02 03
FusionInsight
System and Data Reliability
System Reliability
• All components without SPOF.
• HA for all management nodes.
• Software and hardware health status monitoring.
• Network plane isolation.
Data Reliability
• Cross-data center DR.
• Third-party backup system integration.
• Key data power-off protection.
• Hot-swappable hard disks.
System and Data Reliability
System Reliability
• All components without SPOF.
• HA for all management nodes.
• Software and hardware health status monitoring.
• Network plane isolation.
Data Reliability
• Cross-data center DR.
• Third-party backup system integration.
• Key data power-off protection.
• Hot-swappable hard disks.
Security
System Permission Data

Security Authentication Security
• Fully open-source Component • Authentication management • Data integrity verification.

enhancement. of user permission.
• File data encryption.
• Operating system security • User permission control of
hardening. different components.
Network Security and Reliability - Dual - Plane
Networking
Network Type Trustworthiness Description

App-Server App-Server
Cluster service
plane Hadoop cluster core
components for the
Cluster service plane High
OMS-Server storage and transfer
of service data
Cluster
management plane
It only manages the
Cluster management plane Medium cluster and is involved
Web UI-Client
with no service data
Only web services

Maintenance network Maintenance network
Outside the cluster Low provided by the OMS
outside the cluster
server can be accessed
Visualized Cluster Management, Simplifying O&M
Good
Bad
Unknown
Good
Graphical Health Check Tool (1)
Check item pass rate
28%
Check item failure rate
72%
Check item pass rate
Graphical Health Check Tool (2)
Qualification ratio of Node qualification rate

inspection items
12% 100%
100%
80%
88% 60%
40%
20% 0%
0%
FusionInsight
 Qualification ratio of inspection items  Node qualification rate

 Disqualification ratio of inspection items  Node disqualification rate
Easy Development
Native APIs of
Enhanced APIs
HBase
try { try {
table = new HTable(conf, TABLE); table = new ClusterTable(conf, CLUSTER_TABLE);
// 1. Generate RowKey. // 1. Create CTRow instance.
{......} CTRow row = new CTRow();
// 2. Create Put instance. // 2. Add columns.
Put put = new Put(rowKey); {........}
// 3. Convert columns into qualifiers(Need to consider merging } // 3. Put into HBase.
cold columns). table.put(TABLE, row);
// 3.1. Add hot columns. } catch (IOException e) {
{.......} // Does not care connection re-creation.
// 3.2. Merge cold columns.
{.......} Enhanced HBase SDK
put.add(COLUMN_FAMILY, Bytes.toBytes("QA"), hotCol); HBase
// 3.3. Add cold columns. Recoverable Schema table design
put.add(COLUMN_FAMILY, Bytes.toBytes("QB"), coldCols) Connection Manager Data tool
The HBase table design tool,

connection pool management
HBase API
function, and enhanced SDK are
used to simplify development of HBase
complex data tables.
FusionInsight Spark SQL
SQL compatibility: Long-term stability test:

• All 99 TPC-DS cases of the standard SQL:2003 are • Memory optimization-resolves memory leakage problems,
passed. decentralizes broadcasting, and optimizes Spark heap memory.
Data update and deletion:
• Communication optimization-RPC enhancement, shuffle fetch
• Spark SQL supports data insertion, update, and deletion optimization, and shuffle network configuration.
when the CarbonData file format is used.
• Scheduling optimization-GetSplits ( ), AddPendingTask ( )
Large-scale Spark with stable and high
acceleration ( ), DAG serialization reuse.
performance :
• Is used to test the TPC-DS long-term stability in the scale • Extreme pressure test-24/7 pressure test, HA test.
of 100 TB data volume.
• O&M enhancement-Log security review and DAG UI optimization.
Spark SQL Multi - Tenant
JDBCServer (Proxy) Yarn
YarnQuery Tenant A
Spark JDBC
JDBC Proxy 1 Spark JDBCServer 1
Beeline
Spark JDBC Spark JDBCServer 2
Proxy 2
JDBC YarnQuery Tenant B
Beeline Spark JDBC
Proxy X
Spark JDBCServer 1
...
Spark JDBCServer 2
• The community's Spark JDBCServer supports only single tenants. A tenant is bound to a Yarn resource queue.
• FusionInsight Spark JDBCServer supports multiple tenants, and resources are isolated among
different tenants.
Spark SQL Small File Optimization
1 MB+1 MB … 1 MB+1 MB … RDD2
coalesce coalesce coalesce coalesce coalesce coalesce
1 MB 1 MB 1 MB … 1 MB 1 MB 1 MB … RDD1
1 MB 1 MB 1 MB … 1 MB 1 MB 1 MB … HDFS
Text / Parquet / ORC / Json Table on HDFS
Apache CarbonData - Converging Data Formats
of Data Warehouse (1)
OLAP
( multidimensional analysis ) CarbonData:
A single file format meets the
requirements of different access types.
Sequential access Random access

(large-scale scanning) (small range scanning)
SQL
Hive Engine Spark-SQL

SQL support SQL support
• Random access (small-scale scanning): 7.9
Execution
Distrided
to 688 times.
MapReduce Spark Flink
• OLAP / Interactive query: 20 to 33 times.
• Sequential access (large-scale scanning) :
Storage
ORC File Parquet File CarbonData File

1.4 to 6 times.
Columnar Columnar Full indexed,
Storage Storage hybrid storage
Apache CarbonData - Converging Data Formats
of Data Warehouse (2)
• Apache Incubator Project since June 2016 CarbonData
• Apache releases
Compute
4 stable releases
Latest 1.0.0, Jan 28, 2017
Apache Spark Flink HIVE
• Contributors:
Alibaba Group ebay HUAWEI hulu InMobi Storage

BANK OF
Intel Letv Meituanwaimai Talend COMMUNICATIONS
CarbonData
• In Production:
BANK OF Hadoop
COMMUNICATIONS HUAWEI hulu
CarbonData supports IUD statements and provides data update and deletion capabilities in
big data scenarios. Pre-generated dictionaries and batch sort improve CarbonData import
efficiency while global sort improves query efficiency and concurrency.
CarbonData Enhancement
GUI More Analysis Tools
Thrift Server
Spark SQL
Spark
Other CARBON
Data
Sources File Format
• Quick query response: CarbonData features high-performance query. The query speed of CarbonData is ten times that
of Spark SQL. The dedicated data format used by CarbonData is designed based on high-performance queries,
including multiple index technologies, global dictionary codes, and multiple push down optimizations, thereby quickly
responding to TB-level data queries.
• Efficient data compression: CarbonData compresses data by combining the lightweight and heavyweight compression
algorithms. This compression method saves 60% to 80% data storage spaces coupled with significant hardware storage
cost savings.
Flink - Distributed Real - Time
Processing System
F link is a distributed real-time processing system with low latency (latency

measured in milliseconds), high throughput, and high reliability, which is
promoted by Huawei in the IT field. Flink is integrated into FusionInsight HD for
sale.
Flink is a unified computing framework that supports both batch processing

and stream processing. It provides a stream data processing engine that
supports data distribution and parallel computing. Flink features stream
processing and is a top open-source stream processing engine in the industry.
Flink is suitable for low-latency data processing scenarios. Flink provides high-
concurrency pipeline data processing, millisecond-level latency, and high
reliability.
Visible HBase Modeling
Column Family Column Family

A collection of columns A collection of columns that have
that have service service association relationships.
association relationships.
Column
User list:
Each column Qualifier
indicates an HBase column
attribute of Mapping Each column indicates a KeyValue.
service data.
Reverse (Column1, 4) Column2 Column3
HBase Cold Field Merging Transparent to
Applications
User Data
ID Name Phone ColA ColB ColC ColD ColE ColF ColG ColH
A B C D
HBase KeyValues
Problems
• High expansion rate and poor data query performance due to the HBase column increase.
• Increased development complexity and metadata maintenance due to the application layer
merging cold data columns.
Features
• Cold field merging transparent to applications.
• Real-time write and batch import interfaces.
Hive / HBase Fine - Grained Encryption
Application scenarios
Hive / HBase
• Data saved in plaintext mode may cause security risks of
Sensitive Insensitive sensitive data leakage.
Sensitive
data write data read data
Solution
• Hive encryption of tables and columns.
• HBase encryption of tables, column families, and columns.
Encryption / Decryption • Encryption algorithms of AES and SM4, and user-defined
encryption algorithms.
HDFS Customer benefits

*(&@#$^%!%$#$!(*^&*^*5 Insensitive • Sensitive data is encrypted and stored by table or column.
!$!@^%$^!$!%#$@%#!!$ data
#@! • Algorithm diversity and system security.
• Encryption and decryption transparency to services.
HBase Secondary Indexing
UserTable UserTable_idx UserTable

ColumnFamily CF ColumnFamily
RowKey RowKey RowKey
colA colB colC Data Scanning area colA colB colC
a0001 01 a0001#coluA01#a0001 a0001 01
a0002 02 a0001#coluA02#a0002 a0002 02
a0003 06 a0001#coluA03#a0006 a0003 06
a0004 08 a0001#coluA04#a0005 a0004 08
Destination line
a0005 04 a0001#coluA06#a0003 a0005 04
a0006 03 B C a0001#coluA08#a0004 a0006 03 B C
No index: “Scan+Filter”, scanning a large Secondary index: The target data can be located
amount of data. after twice I/Os.
• Index Region and Data Region as companions under a unified processing mechanism.
• Original HBase API interfaces, user-friendly.
• Coprocessor-based plug-ins, easy to upgrade.
• Write optimization, supporting real-time write.
CTBase Simplifies HBase Multi - Table Service Development
Transaction CTBase
Account_id Amount Time AccountInfo

A0001 Andy $100232
record
12/12/2014
A0001 $100 12/12/2014
18:00:02 A0001 $100
18:00:02
10/12/2014
A0001 $1020 Transaction
15:30:05 10/12/2014
A0001 $1020
15:30:05 record
09/12/2014
A0001 $89
13:00:07 09/12/2014
A0001 $89
13:00:07
11/12/2014
A0002 $105
20:15:00
A0002 Lily $902323
AccountInfo
11/12/2014
Account_ Account_ A0002 $105
Account_id 20:15:00
name balance
11/11/2014
A0001 Andy $100232 A0002 $129
18:15:00
A0002 Lily $902323
A0003 Selin $90000
A0003 Selin $90000
HFS Small File Storage and Retrieval Engine
File File File File File File File File

Application scenario
File File File File File File File File
• A large number of small files and associated description information
File File File needs to be stored.
Files
Current problem
HBase FileStream • A large number of small files are stored in the Hadoop Distributed File
(HFS) System (HDFS), which brings great pressure to the NameNode. HBase
Metadata and Medium / Large
A1 small files files
B1 stores a large number of small files, and Compaction wastes I/O
resources.
HBase Raw API
File
A2 HFS solution value
File
• The HFS stores not only small files but also metadata description
META File File information related to the files.
Data HDFS
File File
• The HFS provides a unified and friendly access API.
MOB
HFile (MOB) • The HFS selects the optimal storage solution based on the file size.
HBase Small files are directly stored in the Medium-sized Objects (MOB).
Large files are directly stored in the HDFS.
Label - based Storage
The data of online applications is stored only on nodes
labeled with "Online Application" and is isolated from the
I/O conflicts affect online services. data of offline applications. This design prevents I/O
competition and improves the local hit ratio.
Online Offline Online Offline

application application application application
processing
processing
processing
application
application
application
Online
Online
Online
Batch
Batch
Batch
HDFS common storage HDFS label-based storage
• Solution description: Label cluster nodes based on applications or physical characteristics, for example, label a
node with “Online Application.” Then application data is stored only on nodes with specified labels.
• Application scenarios:
1. Online and offline applications share a cluster.
2. Specific services (such as online applications) run on specific nodes.
• Customer benefits:
1. I/Os of different applications are isolated to ensure the application SLA.
2. The system performance is improved by improving the hit ratio of application data.
Label - based Scheduling
Spark MapReduce Spark MapReduce

application application application application
Large memory
Large memory
memory
Default
Default
Default
Large
Common scheduling Label-based scheduling
Fine-grained scheduling based on application awareness, improving resource utilization

• Different applications such as online and batch processing are running on nodes with their specific labels to
absolutely isolate computing resources of different applications and improve service SLA.
• Applications that have special requirements on node hardware are running only on nodes with special
hardware, for example, Spark applications need to run on nodes with large memory. Resources are scheduled
on demand, improving resource utilization and system performance.
CPU Resource Configuration Period Adjustment
Batch processing application Real-time application Batch processing application Real-time application
Hive / Spark / … Hive / Spark / …

HBase HBase
QA QB QC QD QA QB QC QD
CPU CPU
Cgroup1 40% Cgroup2 60% Cgroup1 80% Cgroup2 20%
7:00 Time
20:00
• Solution description: Different services have different proportions of resources in different time segments. For
example, from 7:00 a.m. to 20: 00 p.m., real-time services can be allocated to 60% resources at peak hours. From
20:00 p.m. to 7: 00a.m., the 80% resource can be allocated to the batch processing applications when the real-time
services are at off-peak hours.
• Application scenario: The peak hours and off-peak hours of different services are different.
• Customer benefit: Services can obtain as many resources as possible at peak hours, boosting the average resource
utilization of the system.
Resource Distribution Monitoring
Remaining HDFS Capacity ?
Unit:GB • Hacluster Ramaining HDF8 Capacity
1705.00
1700.00
1695.00
1690.00
1685.00
1680.00
1675.00
04-27 21:05:00 04-27 21:15:00 04-27 21:30:00 04-27 21:45:00 04-27 22:00:00 04-27 22:05:00
Benefits
• Quick focusing on the most critical resource consumption.
• Quick locating of the node with the highest resource consumption to take appropriate measures.
Dynamic Adjustment of the Log Level
• Application scenario: When a fault occurs in the Hadoop cluster, quickly locating the fault needs to change the log level. During
log level modification, the process cannot be restarted, resulting in service interruption.
How do I resolve this problem?
• Solution: Dynamically adjusting the log level on the Web UI.
• Benefits: When locating a fault, you can quickly change the log level of a specified service or node without restarting the service
or interrupting services.
Wizard - based Cluster Data Backup
Wizard - based Cluster Data Restoration
Multi Tenant Management
Multi-level tenant management
Company Enterprise tenant
Dept. A
Tenant A_1 Tenant A
Sub-department A_1
Computing resources Yarn queue (CPU / memory / I/O)
Storage resources HDFS (storage space / file overview)
Service resources HBase ...
One Stop Tenant Management
Visualized, Centralized User Rights Management
Visualized, centralized user rights management is easy to use, flexible, and refined:
• Easy to use: visualized multi-component unified user rights management.
• Flexible: role-based access control (RBAC) and predefined privilege sets (roles) which can be used repeatedly.
• Refined: multi-level (database / table / column-level) and fine-grained (Select / Delete / Update / Insert / Grant)
authorization.
Automatic NTP Configuration
External NTP Server
NTP Client
Management Management
Node (Active) Node (Standby)
NTP Server NTP Client
NTP Client NTP Client NTP Client NTP Client NTP Client
DataNode DataNode DataNode DataNode Control Node
Automatically Configuring Mapping of Hosts
Benefits
• Shorten environment preparation time to install
the Hadoop cluster.
• Reduce probability of user configuration errors.
• Reduce the risk of manually configuring mapping
for stable running nodes after capacity expansion
in a large-scale cluster.
Rolling Restart / Upgrade / Patch
HDFS rolling upgrade example: Upgrade Without

Service Interrupting
• Modifying a Configuration Service interruption duration of core
Services
• Performing the Upgrade components: no interruption in 12
• Installing the Patch hours ZooKeeper
HDFS
Yarn
C70 Client
C60
HBase 95
Storm
HDFS Cluster Flume
NameNode NameNode Loader
Spark
Hive
DataNode DataNode DataNode DataNode DataNode
Solr
01 02 03
FusionInsight
Huawei Smart Transportation Solution
Secure Organized
• Challenges to key vehicle identification: insufficient • Challenges to checkpoint and e-police capabilities:
capability of key vehicle automatic identification. rigid algorithm.
• Insufficient traffic accident detection capability: blind spot, • Challenges to violation review and handling
weak detection technology, and manual accident capabilities: heavy workload.
reporting and handling.
• Challenges to special attack data. analysis
• Low efficiency of special attacks: information capabilities: manual analysis and taking 7-30 days.
fragmentation and poor special attack platform.
Smooth Intelligent
• Challenges to traffic detection capability: faulty detection • Computing intelligence challenges: closed system and
devices, low detection efficiency, and low reliable technology and fragmented information.
detection results. • Perceptual intelligence challenges: weak awareness of
• Challenges to traffic analysis capabilities: not shared traffic, events, and peccancy.
traffic information among cities. • Cognitive intelligence challenges: lack of traffic
• Challenges to traffic signal optimization. awareness in regions and intersections.
Traffic Awareness in the Whole City: Deep Learning and
Digital Transformation
• No camera is added. By deep learning and intelligent analysis, about 50 billion real-time pavement traffic parameters are
added every month, which lays a foundation for digital transformation of traffic.
Vehicle traffic and Traffic flow

event awareness analysis
Traffic accident
Traffic
perception and
signal
analysis
optimization
Deep learning Algorithm warehouse

platform Deep learning training Deep learning reasoning Deep learning search engine
engine engine
Video cloud storage and cloud Traffic big data attacks modeling engine and
computing platform time and space analysis engine
More than 3000 channels of

Monitoring more than 6000 roads More than 4000 traffic checkpoints
HD e-police
Note: The preceding figures use a city as an example
Traffic Big Data Analysis Platform
Key vehicle Key vehicle

traffic analysis violation analysis
Number of vehicles Number of vehicles
(400 million) (400 million)
+pass records +illegal records
(12.6 billion) (2.6 billion)
National transportation
integrated command
Detection Buy and sell Serving 400 million vehicles in provinces

replacement analysis analysis and cities in China, the traffic big data
Number of vehicles Number of vehicles analysis platform analyzes 2.6 billion illegal
(400 million) ( 400 million ) records and 12.6 billion traffic records,
+ illegal records (2.6 billion) + illegal records greatly improving the security and orderly
+ detection records (2.6 billion)
(1.1 billion) + number of drivers who management capability of cross-province
(20 minutes) cleared the license point traffic and reaching the world's leading level.
(110 million)
Limitations of Traditional Marketing Systems
• Customer groups are obtained

• Advertisements can be pushed
through data collection and
only according to the preset
filtering, which is time-
rules.
consuming and labor-
• Real-time marketing by event
consuming.
or location cannot be
• Precise sales cannot
implemented.
be implemented.
Low accuracy
Non–real-
time
• Mainly structured data, unable to
handle semi-structured data. • Marketing strategies and rules
• Customer behavior involved in are fixed. New rules need to be
rule operation and configurations, developed and implemented.
low support rate.
Marketing System Architecture
Application Marketing Marketing Statistical Scheduling

Marketing plan ...
layer execution analysis analysis monitoring
Event detection Recommendation

Model layer Marketing model Rule engine
model engine
Ark Chinasoft big data middleware (Ark)
Huawei enterprise-class big data platform (FusionInsight)

Real-time stream Offline processing component FusionInsight Farmer RTD
processing component
ZooKeeper
ZooKeeper
Big data Flume Spark Loader Hive Farmer
MPPDB
platform Storm / Flink HBase MapReduce
Kafka Redis HDFS / Yarn RTD MQ Redis
Manager
Infrastructure /
Cloud platform x86 server ... x86 server Network device Security device
Big Data Analysis, Mining, and Machine Learning Make
Marketing More Accurate
Predictive Model Model effect

Data analysis
modeling application monitoring and evaluation
Effect evaluation
Marketing
Data source and continuous
activity plan
optimization
Customer
group filtering Multiple
SMS
Marketing channels
Customer data activity App
Twitter
Correlation
analysis
Analysis report
Model effect evaluation, customer data update, and model improvement.
Solution Benefits
Precise: precise customer
Easy to use: self-learning of rules
group mining
• Customer-based 360-degree view. • Customizable / Development

• Customer type-based mining. variables, rules, and rule modes.
• Rule auto-learning and optimization.
Comprehensive: supporting Precise

Reliable: uninterrupted services
various types of data marketing
• Support of various types of data
(structured, unstructured, and
semi-structured). • Always-on service.
• Support of multi-channel
comprehensive analysis.
• Support of statistics analysis.
Real-time: real-time marketing
information push
• Event-based.
• Location-based.
• Millisecond-level analysis based on
full data.
A Carrier: Big Data Convergence to Achieve Big Values
Crowd Credit Service Internet

investigation experience .. access log Signaling log Domain name ..
gathering computing quality . query query log query .
Real-time query platform

Basic analysis platform
Hadoop resource pool
Hive Spark SQL
KV interface SQL
MapReduce Spark interface
Manager
Yarn / ZooKeeper
Yarn / ZooKeeper
HBase
HDFS
ETL
Data source
Traditional data (BOM) New data (Internet)
Philippine PLDT: Converting and Archiving Massive
CDRs
Report / Interactive analysis / Forecast analysis / Text mining CSP
Data Federation
DWH Aggregation Hadoop Archiving CSSD
Periodically obtain the source file from the transit server, convert the files to the T0 / T1
format, and upload the converted files to the CSSD / DWH server.
Structured Data Unstructured Data
Mobile Social Voice
SUN NSN E / / / PLP ODS ... AURA Internet Media to Text
... ...
Hadoop stores original CDRs and structured and unstructured data, improving storage capacity
and processing performance, and reducing hardware costs.
A total of 1.1 billion records (664300 MB) are extracted, converted, and loaded at an overall
processing speed of 113 MB/s, much higher than the 11 MB/s expected by the customer.
Summary
These slides describe the enterprise edition of Huawei FusionInsight HD, focus on FusionInsight HD features
and application scenarios, and describe Huawei FusionInsight HD success cases in the industry.
Quiz
• What are the features of FusionInsight HD?

• Which encryption algorithms are supported byHive / Hbase fine-grained encryption?
• A large number of small files are stored in the Hadoop HDFS, which brings great pressure to the NameNode. HBase stores a
large number of small files, and Compaction wastes I/O resources. What are the technical solutions to this problem?
• What are the levels of logs that can be adjusted?
Quiz
True or False
• Hive supports encryption of tables and columns. HBase supports encryption of tables,
column families, and columns. (T or F).
• User rights management is role-based access control and provides visualized and unified
user rights management for multiple components. (T or F).
Multiple-Answer Question
• Which of the following indicate the high reliability of FusionInsight HD? ( )
A. All components are free of SPOFs. C. Health status monitoring for the software and hardware.
B. All management nodes support HA. D. Network plane isolation.
More Information
• Exam outline:
• Mock exam:
THANK YOU!

HCIA-Big Data V2.0 Training Material

Uploaded by

Copyright:

Available Formats

HCIA-Big Data V2.0 Training Material

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HCIA-Big Data V2.0 Training Material

Uploaded by

Copyright:

Available Formats

What are the features of FusionInsight HD?

What are the features of FusionInsight HD?

Which encryption algorithms are supported by Hive / HBase fine-grained encryption?

Which encryption algorithms are supported by Hive / HBase fine-grained encryption?

HCIA-Big Data V2.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.

I mplementing the national big data strategy to accelerate the construction of a

• Promote the innovation and development of big data technology.

• ​Build a digital economy with data as a key enabler.

• Improve the country's capability in governance with big data.

• Improve people's livelihood by applying big data.

• Protect national data security.

D efinition from Wikipedia:

01 Huge amount of data (Volume) 03 Data processing speed (Velocity)

• Hundreds of millions of devices that support

• CERN: Experiments at CERN are generating

Social data Machine data

D efinition from Wikipedia:

01 Huge amount of data (Volume) 03 Data processing speed (Velocity)

Data about your

Streaming data is Data as a Platform

From databases (DBs) to big data (BD)

Database Big Data

Data type Single (mostly structured) Diversified (structured, semi-structured, or unstructured)

(Fishes, determine whether other types of fishes exist by

Processing tool One size fits all. No size fits all.

From databases (DBs) to big data (BD)

Database Big Data

Data type Single (mostly structured) Diversified (structured, semi-structured, or unstructured)

(Fishes, determine whether other types of fishes exist by

Processing tool One size fits all. No size fits all.

• Decrease of hardware costs.

• Acceleration of network bandwidth.

• Emergence of cloud computing.

• Popularization of intelligent terminals.

• E-commerce and social networks.

• Comprehensive application of electronic maps.

• Internet of Things (IoT).

If all you have is a hammer, everything

Today, it seems that big data is miraculous and omnipotent.

Substitute effective business models

Discover knowledge aimlessly

Substitute the role of experts

Build a model for permanent benefits

Data has penetrated into every

• Discerning essences (services), forecasting trends, and guiding the

Proportion of Top 100 industries using big data

Finance City Medical treatment Sports

Education Telecom others Retail

Big data psychological analysis helped Trump win the

Obtain services at fixed Obtain services anytime and Importance of

Average time for

Duration and frequency of

Has the highest performance- More cost-effective because

Island Travel Preference During China's National Day Holiday

Phuket Koh Samui Bali Kuala Lumpur

0 1000 2000 3000 4000 5000 6000

Area-based people flow threshold > 10,000 people Supervision department

Warning for abnormal increase of people flow

• Build a digital economy with data as a key enabler.