HCIA-Big Data V2.0 Training Material

Download as pdf or txt
Download as pdf or txt
You are on page 1of 547
At a glance
Powered by AI
The document discusses Huawei's big data solutions and products including FusionInsight HD. It covers features of different components like HDFS, Hive, HBase and also success stories of implementations.

FusionInsight HD provides a distributed, scalable and reliable platform for storage, management and analysis of structured and unstructured big data. It supports various workloads and has high availability.

Hive supports encryption of tables and columns. HBase supports encryption of tables, column families, and columns.

HCIA-Big Data V2.

0
Training Material

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


• Chapter 1 Big Data Industry and Technological Trends∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙4
• Chapter 2 HDFS - Hadoop Distributed File System∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙ 55
• Chapter 3 MapReduce - Distributed Off - line Batch Processing and Yarn - Resource Negotiator∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙88
• Chapter 4 Spark2x - In-memory Distributed Computing Engine∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙133
• Chapter 5 HBase - Distributed NoSQL Database∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙180
• Chapter 6 Hive - Distributed Data Warehouse∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙236
• Chapter 7 Streaming - Distributed Stream Computing Engine∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙276
• Chapter 8 Flink - Stream Processing and Batch Processing Platform∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙312
• Chapter 9 Loader - Data Transformation∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙352
• Chapter 10 Flume - Massive Logs Aggregation∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙382
• Chapter 11 Kafka - Distributed Message Subscription System∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙425
• Chapter 12 Zookeeper - Cluster Distributed Coordination Service∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙461
• Chapter 13 FusionInsight HD Solution Overview∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙∙488

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 1
Big Data Industry
and Technological
Trends

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


What big data is
A
Objectives
After completing this course, you will be able to
Big data technological trends and applications
B
understand:
Huawei big data solution
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 3
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 4
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Big Data As a Country Strategy for All Countries

• Group of Eight (G8) has released the G8 Open Data Charter and proposed to
accelerate the implementation of data openness and usability.
USA • The European Union (EU) promotes the Data Value Chain to transform traditional
governance model, reduce common department cost, and accelerate economic growth
and employment growth with big data.

G8 • The Abe Cabinet announced the Declaration to be the World's Most Advanced IT
Nation, which plans to develop Japan's national IT strategy with open public data and
big data as its core.

• The UK Government released the Capacity Development Strategy, which aims to use
UK data to generate business value and boost economic growth, as well as undertakes to
open the core databases in the transportation, weather, medical treatment fields.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Implementing the National Big Data Strategy

I mplementing the national big data strategy to accelerate the construction of a


“Digital China” involves five tasks, which are summarized as follows:

• Promote the innovation and development of big data technology.

• ​Build a digital economy with data as a key enabler.

• Improve the country's capability in governance with big data.

• Improve people's livelihood by applying big data.

• Protect national data security.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Big Data Era

D efinition from Wikipedia:


• Big data is data sets that are so voluminous and complex that traditional data-processing application
software is inadequate to deal with them.

01 Huge amount of data (Volume) 03 Data processing speed (Velocity)


4 V's
02 Various types of data (Variety) 04 Low data value density (Value)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Source of Big Data

• There are more than 200 million messages • There are 2.8 billion smartphone users
every day. worldwide.

• Hundreds of millions of devices that support


• There are more than 300 million active users
the Global Positioning System (GPS) are sold
every day.
each year.

• CERN: Experiments at CERN are generating


• Facebook: 50 TB log data is generated each an entire petabyte (PB) of data every second
day, with over 100 TB analysis data derived. as particles fired around the Large Hadron
Collider (LHC) .

Social data Machine data

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Big Data Era

D efinition from Wikipedia:


• Big data is data sets that are so voluminous and complex that traditional data-processing application
software is inadequate to deal with them.

01 Huge amount of data (Volume) 03 Data processing speed (Velocity)


4 V's
02 Various types of data (Variety) 04 Low data value density (Value)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
All Businesses Are Data Businesses

Data about your


Your business is a customers is as
data business now. valuable as your
customers.

Streaming data is Data as a Platform


Keep data moving.
business opportunity. (DaaP) .

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Differences Between Data Processing in the Big Data Era and
the Traditional Data Processing

From databases (DBs) to big data (BD)


• “Pond fishing” vs. “Sea fishing”. “Fishes” represent the data to be processed.

Database Big Data


Data scale Small (in MB) Large (in GB, TB, or PB)

Data type Single (mostly structured) Diversified (structured, semi-structured, or unstructured)

Modes come ahead of data. Data comes ahead of modes. Modes evolve constantly
Relationship between modes and data
(Ponds come ahead of fishes.) as data increases.

(Fishes, determine whether other types of fishes exist by


Object to be processed Data (fishes in ponds)
certain fishes.)

Processing tool One size fits all. No size fits all.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Differences Between Data Processing in the Big Data Era and
the Traditional Data Processing

From databases (DBs) to big data (BD)


• “Pond fishing” vs. “Sea fishing”. “Fishes” represent the data to be processed.

Database Big Data


Data scale Small (in MB) Large (in GB, TB, or PB)

Data type Single (mostly structured) Diversified (structured, semi-structured, or unstructured)

Modes come ahead of data. Data comes ahead of modes. Modes evolve constantly
Relationship between modes and data
(Ponds come ahead of fishes.) as data increases.

(Fishes, determine whether other types of fishes exist by


Object to be processed Data (fishes in ponds)
certain fishes.)

Processing tool One size fits all. No size fits all.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Big Data Era

C hina's netizens rank the first in the world, and the data volume generated each day also surpasses
others in the world.

A camera working at a
Taobao website rate of 8 Mbit/s
• More than 50 thousand GB data • 3.6 GB data can be generated
generated per day. • per hour.
• 40 million GB data storage volume. • Tens of millions GB data can be
generated each month in one city.

Baidu Hospitals
• 1 billion GB data in total. • The amount of CT image data
• 1 trillion web pages stored. generated for a patient reaches
dozens of GB.
• About 6 billion search requests
to be processed each day. • Tens of billions GB data in a country
needs to be stored each year.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Big Data Era

• Decrease of hardware costs.

• Acceleration of network bandwidth.

• Emergence of cloud computing.

• Popularization of intelligent terminals.

• E-commerce and social networks.

• Comprehensive application of electronic maps.

• Internet of Things (IoT).

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Relationship Between Big Data and People

If all you have is a hammer, everything


looks like a nail.

Today, it seems that big data is miraculous and omnipotent.


However, do not take big data as an all-round way to solve all
problems in the world. Human thought, personal cultural and
behavior modes, as well as the existence and development of
different nations and the society are complicated, intricate, and
unique. Computers cannot be used to let the numbers speak
for themselves. No matter when, it is people who are speaking
and thinking.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 16
What Big Data Cannot Do?
Substitute managers' decision-making capabilities
01 • Big data is not only a technical issue, but also a decision-making issue.
• Data onboarding must be pushed and determined by top leaders.

Substitute effective business models


02 • Balance cannot always be obtained with big data.
• Business models are paramount. We must figure out how to obtain profits in advance.

Discover knowledge aimlessly


03 • Data mining must be worked out with restrictions and targets. Otherwise, it is futile.

Substitute the role of experts


04 • Experts contribute greatly to product modeling, such as IBM Deep Blue and AlphaGo.
• The role of experts may decrease as time goes on. However, experts play a major role in
the initial stage.

Build a model for permanent benefits


05 • Big data requires “live” data (with feedback).
• Models need to be updated through lifelong machine learning.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 18
Big Data Era Leading the Future

Data has penetrated into every


industry and business domain.

• Discerning essences (services), forecasting trends, and guiding the


future are the core of the big data era.

• Guide the efforts we make now with a clear future target and make
due efforts now to secure future success.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 19
Big Data Application Scope

Proportion of Top 100 industries using big data

17%
24%

14%

23% 8%

6%
4% 4%

Finance City Medical treatment Sports

Education Telecom others Retail

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Big Data Application: Politics

Big data psychological analysis helped Trump win the


America's presidential election.

• Donald Trump employed Cambridge Analytica Ltd (CA) to make personality and requirement
analysis on American voters, which acquired personalities of 220 million Americans.

• CA uses the behavior of giving likes by voters on Facebook to analyze the personality traits and
political orientation of voters, classifies voters into three types (Republican supporters,
The cave (data analysis center) Democratic supporters, and swing voters), and focuses on attracting swing voters.

• Trump has never sent emails before. He bought his first smartphone after the presidential
election and was fascinated with Twitter. The messages sent by him on Twitter are data-riven
and vary for different voters.

• For African Americans, they can see the video in which the black is called predators by Hillary
Clinton, and thereby go away from Hillary's ballot box. These dark posts are visible for only
specified users.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Big Data Application: Finance

Obtain services at fixed Obtain services anytime and Importance of


times and places. anywhere. New
Passively receive data. Analyze and create data. customers data mining
Trust market information. Seek for meaningful experience.
Passively receive Review details.
information propagation. Involve in creating content,
products, and experience.
Operate customers
Omni-channel
Traditional
customers

Merchandise
Efficiency
New financial customers
Offer standard industrial services. institutions Scenario-focused
Focus on processes and procedures.
Serve
Passively receive information from a
single source. Contact customers by customers
Flexible personalized
services
customer managers. Interact with
each other in fixed channels and in
inflexible ways.

Traditional
finance

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Big Data Application: Finance Case

Taobao website
Four-hour time difference
between the east and west
coasts of the USA.

Walmart
Walmart uses the sales
analysis result of the east
coast to guide the goods
arrangement of the west
coast on the same day.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Big Data Application: Education
N ow, big data analysis has been applied to American public education and become an important
force of education reform.

Average time for


answering each question

Sequence of
question-answering 12
in exams 11 1 Academic performance

Duration and frequency of


interaction with teachers 10 2 Enrollment rate

Big data
Duration and
correctness of 9 in school 3 Dropout rate
answering questions
education
Question- Rate of admission
answering times
8 4 into higher schools

Hand-raising
times in class
7 5 Literacy accuracy
6
Homework correctness

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Big Data Application: Transportation
Most people may choose railway for a distance less than 500 km, but...
• Example mode of transportation for a distance less than 500 km: Beijing-
Taiyuan.
• Mode of transportation during the Price
Chinese Spring Festival in 2018. Time

Has the highest performance- More cost-effective because


price ratio but with difficulties in the performance-price ratio
Beijing scrambling for tickets of vehicle rental is inferior to
entraining
500 km

Shanghai
Chengdu
500 km
500 km
Guangzhou
500 km Plane Train Vehicle rental Long-distance
coach

• For a 500 km or 6-hours driving distance, railway has the highest performance-price ratio, but the chance of
buying tickets depends upon luck. The performance-price ratio of vehicle rental is inferior to entraining.
According to a survey, in the event of failing in scrambling for train tickets, more than 70% of people will rent a
vehicle to go home.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Big Data Application: Tourism

Island Travel Preference During China's National Day Holiday

3% 3%
5%
5%
29%
5%

9%

11%
18%

12%

Phuket Koh Samui Bali Kuala Lumpur


Okinawa Manila Colombo Jakarta
Jeju Honolulu

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Big Data Application: Tourism

Honolulu

Colombo

Bali

Okinawa

Jeju

Phuket

Jakarta

Manila

Koh Samui

Kuala Lumpur

0 1000 2000 3000 4000 5000 6000


Air ticket price forecast

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Big Data Application: Government and Public
Security
Public security scenario: automatic warning and response.

Area-based people flow threshold > 10,000 people Supervision department


performs real-time
locating of issues at the
initial stage.
Automatic warning system:
The number of people in right The crowd
side of Beijing Olympic Forest gathers to watch
Park exceeds the threshold. an affray.
Delivers the issue
for confirmation.

Reports to upper-
Area-based people Transaction
City or community level departments
flow threshold > processing
monitoring system
2000 people departments
confirmation

Warning for abnormal increase of people flow

Big data analysis can monitor and analyze population


flow into cities.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Big Data Application: Traffic Planning
Traffic planning scenarios: multi-dimensional analysis of the traffic crowd.
 North gate of Beijing Workers'
Gymnasium: The number of
people per hour exceeds 500.
Areas with people  Sanlitun: The number of people
flow exceeding the
per hour exceeds 800.
threshold once  Beijing Workers' Gymnasium:
The number of people per hour
exceeds 1500.

Analyze by people flow Analyze by travel mode


proportion
40%
35% 35%
15% 30%
20% 20%
10%

Younger Older
20-30 30-40 Bus Metro Auto Others
than 20 than 50

Traffic forecast based on the Road Network Planning Bus line planning
crowd

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 29
Big Data Applications: Sports

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 30
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 31
Challenges of Traditional Data Processing
Technologies APP APP

Scalability
There is a gap between
required for big
data scalability
data processing
requirements and
hardware performance.

Scale-up Scale-out
Portal Industry Progress

Special Self-service Data


• High cost for storing massive data.
application Report Select KPI OLAP analysis Data mining management

Appframe Spring
• Insufficient batch data processing
Application Middleware (Weblogic8.2 and Apache Tomcat 5.2)
performance.
• Lack of streaming data processing.
DB2 Oracle Sybase TD
• Limited scalability.
• Single data source.
Minicomputer Resource: P595, P570, and P690 Storage Network • External value-added data assets.

Traditional framework: midrange computer + disk array + commercial data warehouse

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Application Scenarios of the Enterprise
Big Data Platform
Operation Management Supervision Profession

Telecom and finance Finance Government Government


Structured Structured + Structured + Non-structured
Semi-structured semi-structured
• Operation analysis. • Performance • Public security • Audio and video.
• Telecom signaling. management. network monitoring. • Seismic
• Financial • Report analysis. • Technical prospecting.
subledger. • History analysis. investigation for • Weather
• Financial bill. • Social security national security. nephogram.
• Electricity analysis. • Public opinion • Satellite remote.
distribution. • Tax analysis. monitoring. Sensing.
• Smart grid. • Decision-making • China Banking • Radar data.
support and Regulatory • IoT.
prediction. Commission (CBRC)
inspection.
• Food source tracing.
• Environment
monitoring.

• With strong appeals for data analysis in telecom carriers, financial institutions, and governments, new
technologies have been adopted on the Internet to process big data of low value density.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Challenges Faced by Enterprises (1)

Challenge 1: Business departments do not have clear


requirements on big data.

• Many enterprises' business departments are not familiar with big data as well as its application scenarios and benefits. Therefore, it
is difficult for them to provide accurate requirements on big data. The requirements of business departments are not clear. The big
data departments are non-profit departments. Therefore, enterprises' decision-makers worry about low input-output ratio and
hesitate to construct a big data department. Moreover, they even delete lots of valuable historical data because there is no
application scenario.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Challenges Faced by Enterprises (2)

Challenge 2: Serious data silo problems within


enterprises.
• The most important challenge faced by enterprises in implementing big data is data fragments. In large-scale enterprises,
different types of data are often scattered in different departments, so that the same data within an enterprise cannot be
shared and the value of big data cannot be fully used.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Challenges Faced by Enterprises (3)

Challenge 3: Low data availability and poor quality. The problem locating
time is decreased
by 50%.

• Many large and medium enterprises generate a large amount of data each day. However, some
Manual checks are
decreased due to
self-service on

enterprises pay no attention to big data preprocessing, resulting in nonstandard data processing.
problem detection.

The service revenue


During big data preprocessing, data needs to be extracted and converted into data that is easy to Availability is
improved by 10%.
is improved more
than 20%.

be processed, cleaned, and denoised to obtain valid data. According to data from Sybase, if high- Manual participation
is not required due to
quality data availability improves by 10%, enterprise revenue will improve more than 10%. proactive problem
detection.

The time spent in


identifying problems
is reduced by 90%.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Challenges Faced by Enterprises (4)

Challenge 4: Data-related management technology


and architecture.

• Traditional databases cannot process hundreds of TB-scale data or above.


• Data diversities are not considered in traditional databases. In particular, the compatibility of structured data, semi-
structured data, and non-structured data is not considered.
• Traditional databases do not have high requirements on the data processing time. However, big data needs to be
processed in real time.
• O&M of massive data needs to ensure data stability, supports high concurrency, and reduces the server load.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Challenges Faced by Enterprises (5)

Challenge 5: Data security.

• Network-based lives let criminals obtain personal information easily, and also lead to more crime methods
that are difficult to be tracked and prevented.
• How to ensure personal information security becomes an important subject in the big data era. In addition,
with the continuous increase of big data, requirements on the security of physical devices for storing data as
well as on the multi-copy and disaster recovery mechanism of data will become higher and higher.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Challenges Faced by Enterprises (6)

Challenge 6: Insufficient big data talents.

• Each step of big data construction must be completed by professionals. Therefore, it is necessary to develop
and build a professional team that understands big data, knows much about administration, and has
experience in big data applications. Hundreds of thousands of big data-related jobs are increased globally
ever year. More than 1 million talent gaps will appear in the future. Therefore, universities and enterprises
make joint efforts to develop and mining talents.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Challenges Faced by Enterprises (7)

Challenge 7: Tradeoff between data openness and


privacy.
• Today, with the increased importance of big data applications, opening and sharing of data resources has
become a key factor in maintaining advantages during data wars. However, the data opening will inevitably
infringe the privacy of some users. How to effectively protect the privacy of citizens and enterprises and
gradually strengthen the privacy legislation while promoting all-round data opening, applications, and sharing
will be a major challenge in the big data era.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 40
From Batch Processing to Real - Time Analysis

• Hadoop is a basis for batch processing of big data, but Hadoop cannot provide real-time analysis.

Apache Hadoop Ecosystem


Ambari
Provisioning, Managing and Monitoring Hadoop Clusters

Machine Learning

R Connectors
Data Exchange

Columnar Store
SQL Query
Sqoop

Mahout
Workflow

Statistics
Scripting

HBase
Oozie

Hive
Pig
YARN MapReduce v2
ZooKeeper
Log Collector

Coordination

Distributed Processing Framework


Flume

HDFS
Hadoop Distributed File System

• Real-time intelligentization of highly integrated high-value information and knowledge is a main


commercial requirement.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 41
Hadoop Reference Practice in the Industry
R
Intel Distribution for Apache Hadoop* software
Intel R Manager for Apache Hadoop software
Deployment, Configuration, Monitoring, Alerts, and Security

R
Data Exchange Oozie 3.3.0 Pig 0.9.2 Mahout 0.7 connectors
Sqoop1.4.1
Hive 0.9.0

Columnar Store
HBase 0.94.1
Workflow Scripting Machine Learning Statistics SQL Query
ZooKeeper 3.4.5
Coordination
YARN (MRv2)
Distributed Processing Framework
Log Collector
Flume 1.3.0

HDFS 2.0.3
Hadoop Distributed File System

Intel proprietary
Intel enhancements contributed back to open source All external names and brands are claimed as the property of others.
Open source components included without change

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 42
In-Memory Computing Reference Practice in the
Industry
Google PowerDrill

 Based on column-oriented storage,


PowerDrill uses the in-memory
client query execution tree
computing technology to deliver the
performance of querying trillions of
cell data per second, which is 10 to
root server
100 times of the traditional column-
oriented storage performance.
intermediate
 PowerDrill can quickly skip servers
unnecessary data blocks. Compared
with the full scanning performance,
the PowerDrill performance is leaf servers
improved by 100 times. (with local storage)
 Data memory usage space can be
optimized and reduced to 1/16 using
the compression and encoding
storage layer (e.g., GFS)
technologies.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 43
Stream Computing Reference Practice in the
Industry

• IBM InfoSphere Streams is one of the core components of IBM's big data
strategy, supports high-speed processing of structured and unstructured data,
processing data in motion, throughput of millions of events per second, high
expansibility, and the streams processing language (SPL).

IBM
• HStreaming conducted a streaming reconstruction on the Hadoop MapReduce
framework. The reconstructed Hadoop MapReduce framework is compatible
with the existing mainstream Hadoop infrastructures. The Hadoop MapReduce
framework processes data in streaming MapReduce mode under the premise of
making no / tiny changes on the framework. Gartner rated HStreaming as the
coolest ESP vendor. Now, the reconstructed framework supports text and video
processing using the Apache Pig language (that is, Pig Latin) and provides the
high scalability of Hadoop, throughput of millions of events per second, and
millisecond-level delay.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 44
Opportunities in the Big Data Era

O pportunity: The big data blue ocean strategy becomes a new


focus of enterprise competition.

• The huge commercial value brought by big


data will lead a great transformation that is
equal in force with the computer revolution in
the twentieth century. Big data is affecting
each field, including commercial, economic,
and other fields. Big data is promoting the
generation of a new blue ocean, bringing new
growth point of economy, and is becoming a
new focus of enterprise competition.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 45
Talents Required During the Development of Big Data

• Big data system R&D engineers.

• Big data application development engineers.

• Big data analysts.

• Data visualization engineers.

• Data security R&D engineers.

• Data science research talents.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 46
CONTENTS
01 02 03 04
Big Data Era Big Data Application Scope Opportunities and Huawei Big Data
Challenges in the Big Data Solution
Era

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 47
Huawei Big Data Platform Architecture
Application service layer

Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive MapReduce Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

• The Hadoop layer provides real-time data processing environment, which is enhanced based on the community open
source software.
• The DataFarm layer provides end-to-end data insight and builds the data supply chain from data to information,
knowledge, and wisdom, including Porter for data integration services, Miner for data mining services, and Farmer for
data service frameworks.
• Manager is a distributed management architecture. The administrator can control distributed clusters from a single
access point, including system management, data security management, and data governance.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 48
Core Capabilities of Huawei Big Data Team
Be able to establish
top-level projects that
Be able to take are adaptable to the
the lead in the eco-system in the
communities and communities
Be able to develop future-
independently oriented kernel
complete kernel- features
level development
for critical
service features
Be able to resolve
kernel-level
problems by team
Be able to resolve kernel-
level problems
(outstanding individuals)

Be able to locate
peripheral problems Large number of components and code

Apache open-source Frequent component updates


Be able to use Hadoop
community ecosystem
Efficient feature integration

• Outstanding product development and delivery capabilities and carrier-class operation support capabilities empowered by the
Hadoop kernel team.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 47
49
Big Data Platform Partners from Finance and
Carrier Sectors

Industrial and
China Merchants Pacific Insurance
Commercial Bank China Mobile China Unicom
Bank (CMB) Co., Ltd. (CPIC)
of China (ICBC)

Top 3 50%
China Telco Top 10 Customers in China's
Financial Industry

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 50
Summary
These slides introduce:
• The big data era.
• Applications of big data in all walks of life.
• Opportunities and challenges brought by big data.
• Huawei big data solution.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 51
Quiz

1. Where is big data from? What are the features of big data?

2. Which social fields can big data be applied to?


3. What is Huawei big data solution called?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 52
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 53
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of HDFS

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


HDFS application scenarios
A
Objectives
Upon completion of this course, you will be able
HDFS system architecture
B
to know:
Key HDFS features
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 56
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 57
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 58
Dictionary vs. File System

Dictionary File System


Character index File name Metadata
Dictionary body Data block

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 59
HDFS Overview

Hadoop distributed file system (HDFS) is developed based on Google file system
(GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems, HDFS also
provides the following features:
• High fault tolerance: resolves hardware unreliability problems.

• High throughput: supports applications involved with a large amount of data.

• Large file storage: supports TB and PB level data storage.

HDFS is applicable to: HDFS is inapplicable to:


• Store large files. • Store massive small files.
• Streaming data access. • Random write.
• Low-delay read.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 60
HDFS Overview

Hadoop distributed file system (HDFS) is developed based on Google file system
(GFS) and runs on commondity hardware.
In addition to the features provided by other distributed file systems, HDFS also
provides the following features:
• High fault tolerance: resolves hardware unreliability problems.

• High throughput: supports applications involved with a large amount of data.

• Large file storage: supports TB and PB level data storage.

HDFS is applicable to: HDFS is inapplicable to:


• Store large files. • Store massive small files.
• Streaming data access. • Random write.
• Low-delay read.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 61
HDFS Application Scenarios

H DFS is a distributed file system of the Hadoop technical framework and is


used to manage files on multiple independent physical servers.

It is applicable to the following scenarios:

• Website user behavior data storage.


• Ecosystem data storage.
• Meteorological data storage.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 62
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 63
Position of HDFS in FusionInsight
Application service layer

Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

As a Hadoop storage infrastructure, HDFS serves as a distributed, fault-


tolerant file system with linear scalability.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 64
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in HDFS System Key Features
Application Scenarios FusionInsight HD Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 65
Basic System Architecture
HDFS Architecture

Metadata (Name,replicas,…) :
NameNode /home/foo/data,3,…

Metadata ops

Block ops
Client

Read DataNode DataNodes

Replication

Blocks Blocks

Client
Rack 1 Rack 2

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 66
HDFS Data Write Process

2:create
1:create Distributed
HDFS NameNode
3:write FileSystem
Client 7:complete

6:close FSData NameNode


OutputStream

Client node

4:write packet 5:ack packet

4 4
DataNode DataNode DataNode
5 5

DataNode DataNode DataNode

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 67
HDFS Data Read Process

2:get block location


1:open Distributed
HDFS NameNode
3:read FileSystem
Client
6:close FSData NameNode
InputStream

Client node
5:read

4:read

DataNode DataNode DataNode

DataNode DataNode DataNode

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 68
CONTENTS
01 02 03 04
HDFS Overview and Position of HDFS in FusionInsight HDFS System Key Features
Application Scenarios HD Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 69
Key Design of HDFS Architecture

NameNode / DataNode
Federation storage
in master / slave mode

Unified file system


Data storage policy
Namespace

HA HDFS Data replication

Multiple access modes Metadata persistence

Space reclamation Robustness

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 70
HDFS High Availability (HA)

ZooKeeper ZooKeeper ZooKeeper

Heartbeat Heartbeat
EditLog
ZKFC JN JN JN ZKFC
Read log
Write log

FSlmage
Metadata NameNode synchronization NameNode
operation (Active) (Standby)
HDFS
Block operation
Client Data read write Heartbeat

Copy

DataNode DataNode DataNode DataNode

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 71
Metadata Persistence

Active NameNode Standby NameNode


2. Obtains Editlog and FSImage from the active node.
Download FSImage when NameNode is initialized
Editlog FSImage and the local FSImage file is used later.

1. Rolls back Editlog.

Editlog.new Editlog FSImage


3. Merges Editlog
And FSImage.
FSImage.ckpt
4. Uploads the new FSImage
to the active node.
FSImage.ckpt
5. Rolls back FSImage.

Editlog FSImage

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 72
HDFS Federation
APP Client-1 Client-k Client-n

HDFS Namespace-1 Namespace-k Namespace-n

NN1 NN-k NN-n


Namespace

NS1 … NS-k … NS-n

Pool 1 Pool Pool n


Block Pools
Block Storage

Common Storage

DataNode1 DataNode2 DataNodeN


… … …

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 73
Data Replication
Data Center Placement policy
Distance=4
Distance=4
Distance=0

Client B1 B2 Node1 B4 Node1

Distance=2 Node2 Node2 Node2

B3 Node3 Node3 Node3

Node4 Node4 Node4

Node5 Node5 Node5

RACK1 RACK2 RACK3

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 74
Configuring HDFS Data Storage
Policies

B y default, the HDFS NameNode automatically


selects DataNodes to store data replicas. There are the
following scenarios in practice:

• Select a proper storage device for layered data storage from


multiple devices on a DataNode.
• Select a proper DataNode according to directory tags that
indicate data importance levels.
• Store key data in highly reliable node groups because the
DataNode cluster uses heterogeneous servers.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 75
Configuring HDFS Data Storage Policies - Layered
Storage
Configuring DataNode with layered storage :
• The HDFS layered storage architecture provides four types of storage devices: RAM_DISK (memory
virtualization hard disk), DISK (mechanical hard disk), ARCHIVE (high-density and low-cost storage
media), and SSD (solid state disk).
• Storage policies for different scenarios are formulated by combining the four types of storage devices.

Block Location Alternative Alternative Replica


Policy ID Name
(Number of Replicas) Storage Policy Storage Policy
15 LAZY_PERSIST RAM_DISK: 1, DISK: n-1 DISK DISK

12 All_SSD SSD: n DISK DISK

10 ONE_SSD SSD: 1, DISK: n-1 SSD, DISK SSD, DISK

7 HOT (default) DISK: n <none> ARCHIVE

5 WARM DISK: 1, ARCHIVE: n-1 ARCHIVE, DISK ARCHIVE, DISK

2 COLD ARCHIVE: n <none> <none>

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 76
Configuring HDFS Data Storage Policies - Tag Storage

NameNode

/HBase T1 /Hive T1 T3

/Spark T2 /Flume T3

DataNode A DataNode B DataNode C


T1 T3 T1 T2

DataNode D DataNode E DataNode F


T1 T2 T3 T2 T3

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 77
Configuring HDFS Data Storage Policies - Node Group
Storage

Rackgroup2
Rackgroup1 (mandatory) Rackgroup3 Rackgroup4

Node 1 Node 3 Node 5 Node 7

Node 2 Node 4 Node 6 Node 8

File 1 (Number of replicas=1) File 2 (Number of replicas=3) File 3 (Number of replicas=2)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 78
Colocation

T he definition of Colocation: is to store associated data or data that is going to be associated on the
same storage node.
According to the picture below, assume that file A and file D are going to be associated with each other,
which involves massive data migration. Data transmission consumes much bandwidth, which greatly
affects the processing speed of massive data and system performance.

NN

F
Aile A
A
A A
A B A
B C A
B D A
C D A File
A B
C D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 79
Colocation Benefits

T he HDFS colocation: is to store files that need to be associated with each other on the same data node so
that data does not have to be obtained from other nodes during associated computing. This greatly reduces
network bandwidth consumption.
When joining files A and D with colocation feature, resource consumption can be greatly reduced because the
blocks of multiple associated files are distributed on the same storage node.

NN

F
Aile A
A C A
A B A
B C A
B A
C A D File
A B
D D File
A C
DN1 DN2 DN3 DN4 DN5 DN6 File
A D

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 80
HDFS Data Integrity Assurance

H DFS ensures the completeness of the stored data. It implements reliability


processing in case of failure of each component.

Reconstructs data replicas in invalid data disks.


• The DataNode periodically reports blocks’ messages to the NameNode, if one replica (block) is failed,
the NameNode will start the procedure to recover lost replicas.

Ensures data balance among DataNodes.


• The HDFS architecture is configured with the data balance mechanism, which ensures the even
distribution of data among all DataNodes.
Ensures metadata reliability.
• The log mechanism is used to operate metadata, which is stored on both active and standby NameNodes.
• The snapshot mechanism of the file system ensures that data can be recovered in a timely manner when a
misoperation occurs.
Provides the security mode.
• HDFS provides a unique security mode to prevent fault spreading when a DataNode or hard disk is faulty.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 81
Other Key Design Points of the HDFS Architecture

Unified file system:


HDFS presents itself as one unified file system externally.

Space reclamation:

The recycle bin mechanism is provided and the number of replicas can be dynamically set.

Data organization:

Data is stored by block in the HDFS.

Access mode:
Data can be accessed through Java APIs, HTTP, or shell commands.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 82
Common Shell Commands

Type Commands Description


-cat Show the file contents
-ls Show a directory listing
-rm Delete files

-put Upload directory / files to HDFS


dfs
-get Download directory / files from HDFS

-mkdir Create a directory


-chmod/-chown Change the group of files
… …
-safe mode Safety mode operation
dfsadmin
-report Report service status

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 28
83
Summary
This module describes the following information about HDFS: basic concepts,
application scenarios, technical architecture and its key features.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 84
Quiz

• What is HDFS and what can it be used for?


• What are the design objectives of HDFS?
• Describe the HDFS read and write processes.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 85
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 86
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles of
MapReduce and YARN

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Concepts of MapReduce and YARN
A
Application scenarios and principles of
MapReduce B
Objectives Functions and architectures of MapReduce
Upon completion of this course, you will be able to
know:
and YARN C
New Features of YARN
D
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 89
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 90
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 91
MapReduce Overview

M apReduce is developed based on the paper issued by Google about MapReduce and is used for parallel
computing of a massive data set (larger than 1 TB) . It delivers the following highlights:

Easy to program Outstanding scalability High fault tolerance


Programmers only need to Cluster capabilities can be Cluster availability and fault
describe what to do, and the improved by adding nodes. tolerance are improved by
execution framework will do policies such as computing or
the job accordingly. data migration.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 92
YARN Overview

A pache Hadoop YARN (Yet Another Resource Negotiator) is a new Hadoop resource manager. It
provides unified resource management and scheduling for upper-layer applications, remarkably
improving cluster resource utilization, unified resource management, and data sharing.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 93
Position of YARN in FusionInsight
Application service layer

OpenAPI / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Streaming Flink governance

Hadoop YARN / ZooKeeper LibrA


Security
management
HDFS / HBase

YARN is the resource management system of Hadoop 2.0. It is a general


resource management module that manages and schedules resources for
applications.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 94
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 95
Working Process of MapReduce (1)
Before starting MapReduce, make sure that the files to be
processed are stored in HDFS.
MapReduce submits requests to ResourceManager. Then
ResourceManager creates jobs. One application maps to one Commit
job (example job ID: job_201431281420_0001).
Job.jar
Job.split
Job.xml
Before jobs are submitted, the files to be processed are split. By
After the jobs are submitted to ResourceManager, default, the MapReduce framework regards a block as a split.
Split
ResourceManager selects an appropriate NodeManager in the Client applications can redefine the mapping relation between
cluster to schedule ApplicationMasters based on the workloads blocks and splits.
of NodeManagers. The ApplicationMaster initializes jobs and
applies for resources from ResourceManager.
ResourceManager selects an appropriate NodeManager to start
the container for task execution. Map
The outputs of Map are placed to the buffer in memory. When
Buffer in the buffer overflows, data in the buffer needs to be written to
memory
local disks. Before that, the following process must be
1. Partition-By default, the hash algorithm is used for partitioning.
completed:
The MapReduce framework determines the number of partitions
based on that of Reduce tasks. The records with the same key Partition
value are sent to the same Reduce tasks for processing.

2. Sort — The outputs of Map are sorted, for example,


Sort
('Hi','1'),('Hello','1') are reordered as ('Hello','1'),('Hi','1').

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 96
Working Process of MapReduce (1)
Before starting MapReduce, make sure that the files to be
processed are stored in HDFS.
MapReduce submits requests to ResourceManager. Then
ResourceManager creates jobs. One application maps to one Commit
job (example job ID: job_201431281420_0001).
Job.jar
Job.split
Job.xml
Before jobs are submitted, the files to be processed are split. By
After the jobs are submitted to ResourceManager, default, the MapReduce framework regards a block as a split.
Split
ResourceManager selects an appropriate NodeManager in the Client applications can redefine the mapping relation between
cluster to schedule ApplicationMasters based on the workloads blocks and splits.
of NodeManagers. The ApplicationMaster initializes jobs and
applies for resources from ResourceManager.
ResourceManager selects an appropriate NodeManager to start
the container for task execution. Map
The outputs of Map are placed to the buffer in memory. When
Buffer in the buffer overflows, data in the buffer needs to be written to
memory
local disks. Before that, the following process must be
1. Partition-By default, the hash algorithm is used for partitioning.
completed:
The MapReduce framework determines the number of partitions
based on that of Reduce tasks. The records with the same key Partition
value are sent to the same Reduce tasks for processing.

2. Sort — The outputs of Map are sorted, for example,


Sort
('Hi','1'),('Hello','1') are reordered as ('Hello','1'),('Hi','1').

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 97
Working Process of MapReduce (2)

3. Combine — By default, this operation is optional. For


example, ('Hi','1'), ('Hi','1'),('Hello','1'), ('Hello','1') are combined Combine
into ('Hi','2'),('Hello','2').

4. Spill — After a Map task is processed, many spill files are


generated. These spill files must be combined into spill file
When the MOF output progress of Map tasks reaches 3%, the Spill/Merge (MOF: MapOutFile) that is partitioned and sorted. To reduce the
Reduce tasks are started and obtains MOF files from each Map amount of data to be written to disks, MapReduce allows MOFs
task. The number of Reduce tasks is determined by clients, and to be written after being compressed.
the number of MOF partitions is determined by that of Reduce
tasks. For this reason, the MOF files outputted by Map tasks
map to Reduce tasks.
Copy

In memory MOF files need to be sorted. If the amount of data received by


or on disk Reduce tasks is small, the data is directly stored in the buffer.
As the number of files in the buffer increases, the MapReduce
Sort/Merge background thread merges the files into a large one. Many
intermediate files are generated during the merge operation.
The last merge result is directly outputted to the Reduce
function defined by the user.

Reduce

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 98
Shuffle Mechanism

Combine Spill / Merge Copy Sort / Merge Reduce

Shuffle
is the data transfer process between the Map phase and Reduce phase involves
obtaining MOF files from the Map tasks of Reduce tasks and sorting and merging
MOF files.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 99
Example: Typical Program WordCount
WordCount
2
App

Resource Name
Manager Node

Slaver #1 Slaver #2 Slaver #3


#1 #3 3
Node Node #2 Node
DataNode DataNode DataNode
Manager Manager Manager
Key Value
Container A.1 Container A.2 Container A.3 a 1
a 1
a 1
1
are 1
Key Value
are 1
a 3 hi 1
are 2 hi 1
hello 1
4 hi 2
hello 1
hello 3 hello 1

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 100
Functions of WordCount

Input Output

Number of times that each


File that contains words
word occurs

Bye 3
Hello World Bye World
Hadoop 4
Hello Hadoop Bye Hadoop MapReduce
Hello 3
Bye Hadoop Hello Hadoop
World 2

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 101
Map Process of WordCount

Input Map Output


<Hello,1>
<World,1>
01 “Hello World Bye World” Map <Bye,1>
<World,1>

<Hello,1>
<Hadoop,1>
02 “Hello Hadoop Bye Hadoop” Map <Bye,1>
<Hadoop,1>

<Bye,1>
03 “Bye Hadoop Hello Hadoop” <Hadoop,1>
Map
<Hello,1>
<Hadoop,1>

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 102
Reduce Process of WordCount

Map Output Map Output Reduce Input Reduce Output

<Hello,1> <Hello,1>
<World,1> <World,2>
<Hello,1 1 1> Reduce Bye 3
<Bye,1> <Bye,1>
<World,1>
<Bye,1 1 1> Reduce Hadoop 4
<Hello,1> <Hello,1>
<Hadoop,1> Combine Shuffle
<Hadoop,2>
<Bye,1> <Bye,1>
<Hadoop,1> <World,2> Reduce Hello 3

<Bye,1>
<Bye,1>
<Hadoop,1>
<Hadoop,2>
<Hello,1> <Hadoop,2 2> Reduce World 2
<Hello,1>
<Hadoop,1>

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 103
Architecture of YARN
Node
Manager

Container App Mstr

client Node
Manager
Resource
Manager
client App Mstr Container

Node
MapReduce Status Manager
Job Submission
Node Status
Container Container
Resource Request

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 104
Task Scheduling of MapReduce on YARN

ResourceManager
1
Client
Applications Resource
Manager Scheduler

2
3 8
4

Node Manager Node Manager


6 2 5
Container 6 6 Container
5

Map Task MR App Reduce Map Task


7 Mstr 7 Task

Container Container 7

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 105
YARN HA Solution

R esourceManager in YARN manages resources and schedules tasks in the cluster. The YARN HA solution
uses redundant ResourceManager nodes to solve single point of failure problem of ResourceManager.

2.Fail-over if the Active


Active RM fails (auto) Standby
ResourceManager ResourceManager

1.Active AM write its


states into ZooKeeper

ZooKeeper Cluster

ZooKeeper ZooKeeper ZooKeeper

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 106
YARN APPMaster Fault Tolerant Mechanism

Container

AM-1

Container Restart / Failure

AM-1

Container

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 107
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 108
Resource Management

• Yarn manages and allocates memory and CPU resources.


• Memory and CPU resources from each NodeManager can
be configured (on the Yarn service configuration page).

yarn.nodemanager.resource.memory-mb

yarn.nodemanager.vmem-pmem-ratio

yarn.nodemanager.resource.cpu-vcore

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 109
Resource Allocation Model

Root
1.Selects a queue
Scheduler Parent Parent

Leaf Leaf Leaf

2.Selects an application
from the queue App1 App 2 … App N

3.Matches requested
resources on the Server A
application
Server B

Rack A

Rack B

Any resources

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 110
Capacity Scheduler Overview

C apacity Scheduler enables Hadoop applications to run in a shared, multi-tenant cluster while maximizing the
throughput and utilization of the cluster.

C apacity Scheduler allocates resources by queue. Users can set upper and lower limits
for the resource usage of each queue. Administrators can restrict the resource used by a
queue, user, or job. Job priorities can be set but resource preemption is not supported.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 111
Highlights of Capacity Scheduler

• Capacity assurance: Administrators can set upper and lower limits for the resource usage of each
queue. All applications submitted to the queue share the resources.

• Flexibility: The remaining resources of a queue can be used by other queues that require resources. If a new
application is submitted to the queue, other queues release and return the resources to the queue.

• Priority: Priority queuing is supported (FIFO by default).


• Multi-leasing: Multiple users can share a cluster, and multiple applications can run concurrently.
Administrators can add multiple restrictions to prevent cluster resources from being exclusively occupied by an
application, user, or queue.

• Dynamic update of configuration files: Administrators can dynamically modify configuration


parameters to manage clusters online.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 112
Task Selection by Capacity Scheduler

During scheduling, select an appropriate queue first based on the following


policies:
• The queue with the lower resource usage is allocated first. For example, you have two queues, Q1 and Q2, and both have the same capacities-30. And
the used capacities of Q1 is 10 and Q2 is 12, resources are allocated to Q1 first.

• Resources are allocated to the queue with the minimum queue hierarchy first. For example, for QueueA and QueueB. childQueueB, resources are
allocated to QueueA first.
• Resources are allocated to the resource reclamation request queue first.

A task is then selected from the queue based on the following policy:
• The task is selected based on the task priority and submission sequence as well as the limits of user resources and memory.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 113
Queue Resource Limitation (1)

Q ueues are created on the Tenant page. After a tenant is created and associated with YARN, a
queue with the same name as the tenant is created. For example, if tenants QueueA and QueueB
are created, two YARN queues QueueA and QueueB are created.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 114
Queue Resource Limitation (2)
Queue resource capacity (percentage) , there are three queues, default, QueueA, and QueueB, and each
has a [queue name]. capacity configuration:

The capacity of the default The capacity of the QueueA The capacity of the QueueB queue is
queue is 20% of the total cluster queue is 10% of the total cluster 10% of the total cluster resources. The
resources. resources. capacity of the root-default shadow
queue in the background is 60% of the
total cluster resources.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 115
Queue Resource Limitation (3)

Due to resource sharing, the resources used by a queue may exceed its
01 capacity (for example, QueueA.capacity) . The maximum resource usage
can be limited by parameter.
Sharing Idle
Resources If only a few tasks are running in a queue, the remaining resource of the
queue can be shared with other queues. For example, if maximum-capacity
02
of QueueA is set to 100 and tasks are running in QueueA only, QueueA can
use all the cluster resources theoretically.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 116
User and Task Limitation

L og into FusionInsight Manager and choose Tenant > Dynamic Resource Plan > Queue
Config to configure user and task limitation parameters.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 117
User Limitation (1)
Minimum resource assurance (percentage) of a user:
• The resources for each user in a queue are limited at any time. If tasks of multiple users are running at the same time in a queue, the resource usage of each user fluctuates
between the minimum value and the maximum value. The maximum value is determined by the number of running tasks, while the minimum value is determined by minimum-
user-limit-percent.

For example, if yarn.scheduler.capacity.root.QueueA.minimum-user-limit-percent=25, the queue


resources are adjusted as follows when the number of users who submit tasks increases:

The first user submits tasks to


The user obtains 100% of QueueA resources.
QueueA
The second user submits tasks to
Each user obtains 50% of QueueA resources at most.
QueueA
The third user submits tasks to
Each user obtains 33.33% of QueueA resources at most.
QueueA
The fourth user submits tasks to
Each user obtains 25% of QueueA resources at most.
QueueA
To ensure that each user can obtain 25% resources at least, the fifth user
The fifth user submits tasks to
cannot obtain any resources and must wait for them to be released by other
QueueA users.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 30
118
User Limitation (2)

Maximum resource usage of a user


(multiples of queue capacity) :

Indicates the multiples of queue capacity. This parameter is used to set the
resources that can be obtained by a user, with a default of 1:
yarn.scheduler.capacity.root.QueueD.user-limit- factor=1, indicating that
the resource capacity obtained by a user cannot exceed the queue
capacity. No matter how many free resources a cluster has, the resource
capacity that can be obtained by a user cannot exceed maximum-
capacity.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 119
Task Limitation

01 02 03

Maximum number of Maximum number Maximum number of


active tasks of tasks in a queue tasks submitted by a
user
Indicates the maximum number of Indicates the maximum number of Depends on the maximum number of tasks in
active tasks in a cluster, including the tasks submitted to a queue. If the a queue. If QueueA allows a maximum of 1000
running or suspended tasks. When parameter value is set to 1000 for tasks, the maximum number of tasks that each
the number of submitted task QueueA, QueueA allows a user can submit is as follows:
requests reaches the limit, new tasks maximum of 1000 active tasks. 1000*yarn.scheduler.capacity.root.QueueA.min
will be rejected. The default value is imum-user-limit-percent(assume 25%)*
10000. yarn.scheduler.capacity.root.QueueA.user-
limit-factor (assume 1).

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 120
Queue Information
Choose Services > YARN > ResouceManager (active) > Scheduler to view queue information.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 121
CONTENTS
01 02 03 04
Introduction to Functions and Resource Enhanced Features
MapReduce and Architectures of Management and
YARN MapReduce and Task Scheduling of
YARN YARN

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 122
Enhanced Features - YARN Dynamic Memory Management

Calculate the memory usage of each


Container

No Containers can run.

No

Does the total memory usage


Does the memory usage exceed the memory threshold
Yes set for NodeManager?
Yes
exceed the container
threshold?

Containers with
NM MEM Thrshold = excessive memory
yarn.nodemanager.resource.memory-mb*1024*1024* usage cannot run.
yarn.nodemanager.dynamic.memory.usage.threshold

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 123
Enhanced Features - YARN Label - based
Scheduling
Applications that have Applications that have
Applications that have
common resource demanding memory
demanding I / O requirements
requirements requirements

Servers with standard Servers with large


Servers with high I / Os
performance memory

NodeManager
NodeManager NodeManager

Queue

Tasks

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 124
Summary
This module describes the following information: application scenarios and architectures of MapReduce and YARN,
Resource management and task scheduling of YARN, and enhanced features of YARN in FusionInsight HD.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 125
Quiz

• What is the working principle of MapReduce?


• What is the working principle of Yarn?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 126
Quiz

• What are highlights of MapReduce?


A. Easy to program.
B. Outstanding scalability.
C. Real-time computing.
D. High fault tolerance.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 127
Quiz

• What is the abstraction of Yarn resources?


A. Memory.
B. CPU.
C. Container.
D. Disk space.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 128
Quiz

• What does MapReduce apply to?


A. Iterative computing.
B. Offline computing.
C. Real-time interactive computing.
D. Stream computing.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 129
Quiz

• What are highlights of capacity scheduler?


A. Capacity assurance.
B. Flexibility.
C. Multi-leasing.
D. Dynamic update of configuration files.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 130
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 131
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles of
FusionInsight Spark2x

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Understand application scenarios and master highlights
of Spark A
Master the computing capability and technical
Objectives
Upon completion of this course, you will be able
framework of Spark B
to:
Master the integration of Spark components in
FusionInsight HD C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 134
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 135
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 136
Spark Introduction

• Apache Spark was developed in the UC Berkeley AMP lab in 2009.

• Apache Spark is a fast, versatile, and scalable in-memory big data computing engine.

• Apache Spark is a one-stop solution that integrates batch processing, real-time stream processing,
interactive query, graph computing, and machine learning.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 137
Application Scenarios

• Batch processing can be used for extracting,


transforming, and loading (ETL).

• Machine learning can be used to automatically


determine whether comments of Taobao
customers are positive or negative.
• Interactive analysis can be used to query the
Hive data warehouse.
• Stream processing can be used for real-time
businesses such as page-click stream analysis,
recommendation systems, and public opinion analysis.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 138
Spark Highlights

Light Fast
• Spark core code has • Delay for small
30,000 lines. 01 02 datasets reaches the
sub-second level.

Spark
Flexible Smart
• Spark offers different 04 03 • Spark smartly uses
levels of flexibility. existing big data
components.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 139
Spark Ecosystem

Spark

Applications Environments Data sources

Hadoo
Hive HBase
p
Mahout Docker Elastic
Flume Mesos Kafka
MySQL

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 140
Spark vs MapReduce (1)

HDFS HDFS HDFS HDFS


Read Write Read Write

Iter.2
Iter.1 Iter.2 … Iter.1 Iter.2 …
Input Input

One-time
processing
HDFS Query 1 Result 1 Query 1 Result 1
Read

Query 2 Result 2 Query 2 Result 2

Input Query 3 Result 3 Input Distributed Query 3 Result 3


memory
… …
Data sharing in MapReduce Data sharing in Spark

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 141
Spark vs MapReduce (2)

Hadoop Spark Spark


Data volume 102.5 TB 102 TB 1000 TB

Time required (min) 72 23 234

Number of nodes 2100 206 206

Number of cores 50,400 6592 6592

Rate 1.4 TB/min 4.27 TB/min 4.27 TB/min

Rate / node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Daytona Gray Sort Yes Yes Yes

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
142
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 143
Spark System Architecture

Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming

Spark Core

Standalone YARN Mesos

Existing functions of Spark 1.0

New functions of Spark 2.0

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 144
Spark System Architecture

Structured Spark
Spark SQL MLlib GraphX SparkR
Streaming Streaming

Spark Core

Standalone YARN Mesos

Existing functions of Spark 1.0

New functions of Spark 2.0

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 145
Core Concepts of Spark - RRD
• Resilient Distributed Datasets (RDDs) are elastic, read-only, and partitioned distributed datasets.
• RDDs are stored in memory by default and are written to disks when the memory is insufficient.
• RDD data is stored in the cluster as partitions.
• RDD has a lineage mechanism (Spark Lineage), which allows for rapid data recovery when data loss occurs.

HDFS Spark cluster External storage

RDD1 RDD2
Hello Spark
Hello
“Hello Spark” “Hello, Spark”
Hadoop
“Hello Hadoop” “Hello, Hadoop”
China Mobile
“China Mobile” “China, Mobile”

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 146
RDD Dependencies

Narrow Dependencies: Wide Dependencies:

groupByKey
map, filter

join with inputs


co-partitioned
union join with inputs not
co-partitioned

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 147
Stage Division of RDD

A: B:

G:
Stage1 groupby

C: D: F:

map join

E:

Stage2 union
Stage3

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 148
RDD Operators

Transformation

• Transformation operators are invoked to generate a new RDD from one or more existing RDDs.
Such an operator initiates a job only when an Action operator is invoked.
• Typical operators: map, flatMap, filter, and reduceByKey.

Action

• A job is immediately started when action operators are invoked.


• Typical operators: take, count, and saveAsTextFile.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 149
Major Roles of Spark (1)

Driver

Responsible for the application business logic and operation planning (DAG).

ApplicationMaster

Manages application resources and applies for resources from ResourceManager as needed.

Client

Demand side that submits requirements (applications).

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 150
Major Roles of Spark (2)

ResourceManager

ResourceManagement department that centrally schedules and distributes resources in the


entire cluster.

NodeManager

ResourceManagement of the current node.

Executor

Actual executor of a task. An application is split for multiple executors to compute.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 151
Spark on YARN - Client Operation Process

Driver
ResourceManager
1. Submit an application
Spark on YARN-Client

YARNClientScheduler ApplicationMaster
Backend

3. Apply for 2. Submit


5. Schedule tasks a container ApplicationMaster

NodeManager
NodeManager
Container
4. Start the container
Executor
ExecutorLauncher
Cache Task

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 152
Spark on YARN - Cluster Operation Process

NodeManager
Spark on YARN-Client Container
4. Driver assigns tasks
1. Submit Executor
an application

ResourceManager 5. Executor Cache Task Task


reports task
statuses

the application
resources for NodeManager
2. Allocate
Container
Container
Executor
ApplicationMaster
Cache
(including Driver)
DAGScheduler Task
3. Apply for
Executor from Task
ResourceManager YARNClusterScheduler

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 153
Differences Between YARN - Client and YARN -
Cluster

Differences

• Differences between YARN-Client and YARN-Cluster lie in


ApplicationMaster.
• YARN-Client is suitable for testing, whereas YARN-Cluster is
suitable for production.
• If the task submission node in YARN-Client mode is down and the
entire task fails. Such a situation in YARN-Cluster mode will not
affect the entire task.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 154
Typical Case - WordCount

textFile flatMap map reduceByKey saveAsTextFile


HDFS RDD RDD RDD RDD HDFS

An apple An apple A (An, 1) (An, 1) (An, 1)


A pair of shoes pair of shoes (apple, 1) (A,1) (A,1)
An apple Orange apple Orange apple (A, 1) (apple, 2)
A pair HDFS
of shoes (apple,
HDFS2)
(pair, 1) (pair, 1) (pair, 1)
Orange apple (of, 1) (of, 1) (of, 1)
(shoes, 1) (shoes, 1) (shoes, 1)
(Orange, 1) (Orange, 1) (Orange, 1)
(apple, 1)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 155
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 156
Spark SQL Overview

• Spark SQL is the module used in Spark for structured data processing. In Spark
applications, you can seamlessly use SQL statements or DataFrame APIs to query
structured data.

SQL AST Logical Code


Analysis Optimization Generation

Cost Model
Unresolved Optimized Selected
DataFrame Logical Plan Physical Plans RDDs
Logical Plan Logical Plan Physical Plan

Catalog
Dataset

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 157
Introduction to Dataset

• A dataset is a strongly typed collection of objects in a


particular domain that can be converted in parallel by a
function or relationship operation.
• A dataset is represented by a Catalyst logical execution
plan, and the data is stored in encoded binary form, and
the sort, filter, and shuffle operations can be performed
without deserialization.
• A dataset is lazy and triggers computing only when an
action operation is performed. When an action
operation is performed, Spark uses the query optimizer
to optimize the logical plan and generate an efficient
parallel distributed physical plan.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 158
Introduction to DataFrame

D ataFrame is a dataset with specified column names. DataFrame is a special case of Dataset [Row].

name age height

Person String Int Double


Person String Int Double
Person String Int Double

Person String Int Double


Person String Int Double
Person String Int Double

RDD [Person] DataFrame

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 159
RDD, DataFrame, and Datasets (1)

RDD: DataFrame:
• Advantages: safe type • Advantages: schema
and object oriented. information to reduce
• Disadvantages: high serialization and
performance overhead deserialization overhead.
for serialization and • Disadvantages: not object-
deserialization; high oriented; insecure during
GC overhead due to compiling.
frequent object creation
and deletion.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 27
160
RDD, DataFrame, and Datasets (2)

Characteristics of Dataset:
• Fast: In most scenarios, performance is superior to RDD. Encoders
are better than Kryo or Java serialization, avoiding unnecessary
format conversion.
• Secure type: Similar to RDD, functions are as secure as possible
during compiling.
• Dataset, DataFrame, and RDD can be converted to each other.

Dataset has the advantages of RDD and DataFrame, and


avoids their disadvantages.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 161
Spark SQL and Hive

• The execution engine of Spark SQL is Spark Core, and the default
execution engine of Hive is MapReduce.
Differences • The execution speed of Spark SQL is 10 to 100 times faster than Hive.
• Spark SQL does not support buckets, but Hive does.

• Spark SQL depends on the metadata of Hive.


• Spark SQL is compatible with most syntax and functions of Hive. Dependencies
• Spark SQL can use the custom functions of Hive.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 162
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 163
Structured Streaming Overview (1)

S tructured Streaming is a streaming data-processing engine built on the Spark SQL


engine. You can compile a streaming computing process like using static RDD data. When
streaming data is continuously generated, Spark SQL will process the data incrementally and
continuously, and update the results to the result set.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 164
Structured Streaming Overview (2)

Data stream Unbounded Table

new data in the


data stream
=
new rows appended to
an unbounded table

Data stream as an unbounded table

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 165
Programming Model for Structured Streaming

Trigger. every 1 sec


1 2 3
Time

Input Data up
to t=1
Data up
to t=2
Data up
to t=3

Query

Result up Result up Result up


Result to t=1 to t=2 to t=3

Output
complete mode

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 166
Example Programming Model of Structured
Streaming
nc
Cat dog Dog
> _ Owl cat
Dog dog owl

1 2 3
Time

Enter data in the Cat dog Cat dog Cat dog


unbounded table Dog dog Dog dog Dog dog
Owl cat Owl cat
Dog
Owl
T=1 computing result T=2 computing result T=3 computing result

Cat 2 Cat 2
Computing results Cat 1
Dog 3 Dog 4
Dog 3
Owl 1 Owl 2

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 167
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 168
Overview of Spark Streaming

S park Streaming is an extension of the Spark core API, which is a real-time computing framework featured with
scalability, high throughput, and fault tolerance.

HDFS
Kafka
Spark Kafka
HDFS Streaming
Database

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 169
Mini Batch Processing of Spark Streaming

S park Streaming programming is based on DStream, which decomposes streaming programming into a
series of short batch jobs.

batches of
input data stream Spark batches of input data Spark processed data
Streaming Engine

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 170
Fault Tolerance Mechanism of Spark Streaming

S park Streaming performs computing based RDDs. If some partitions of an RDD are lost, the lost partitions
can be recovered using the RDD lineage mechanism.

Interval
[0,1)

map reduce

Interval
[1,2)


Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 171
CONTENTS
01 02 03
Spark Overview Spark Principles and Spark Integration
Architecture in FusionInsight HD
• Spark Core
• Spark SQL and Dataset
• Spark Structured Streaming
• Spark Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 172
Spark WebUI

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 173
Spark and Other Components

Ifollowing
n the FusionInsight cluster, Spark interacts with the
components:

• HDFS: Spark reads or writes data in the HDFS (mandatory).


• YARN: YARN schedules and manages resources to support the running of
Spark tasks (mandatory).
• Hive: Spark SQL shares metadata and data files with Hive (mandatory).
• ZooKeeper: HA of JDBCServer depends on the coordination of
ZooKeeper (mandatory).
• Kafka: Spark can receive data streams sent by Kafka (optional).
• HBase: Spark can perform operations on HBase tables (optional).

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 174
Summary
• The background, application scenarios, and characteristics of Spark are briefly introduced.
• Basic concepts, technical architecture, task running processes, Spark on YARN, and application scheduling
of Spark are introduced.
• Integration of Spark in FusionInsight HD is introduced.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 175
Quiz

• What are the characteristics of Spark?


• What are the advantages of Spark in comparison with MapReduce?

• What are the differences between wide dependencies and narrow dependencies of
Spark?
• What are the application scenarios of Spark?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 176
Quiz

• RDD operators are classified into: _________ and _________.

• The ___________ module is the core module of Spark.

• RDD dependency types include ___________ and ___________.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 177
More Information

• Download training materials:


– http://support.huawei.com/learning/trainFaceDetailAction?lang=en&pbiPath=term1000121094&courseId=Node1000011807

• eLearning course:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 178
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of HBase

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


System architecture of HBase
A
Key features of HBase
B
Objectives
Upon completion of this course, you will be able
to know:
Basic functions of HBase
C
Huawei enhanced features of HBase
D
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 181
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 182
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 183
HBase Overview

HBase is a column-based distributed storage system that


features high reliability, performance, and scalability.

• HBase is suitable for storing big table data (which contains billions of rows and millions of columns) and allows real-time
data access.
• HBase uses HDFS as the file storage system to provide a distributed column-oriented database system that allows real-
time data reading and writing.
• HBase uses ZooKeeper as the collaboration service.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 184
HBase vs. RDB

• Distributed storage and column-oriented.


• Dynamic extension of columns.
HBase • Supports common commercial hardware, lowering
the expansion cost.

• Fixed data structure.


• Pre-defined data structure. RDB
• I/O intensive and cost-consuming expansion.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 185
Data Stored By Row

ID Name Phone Address

Data is stored by row in an underlying file system. Generally, a fixed amount of


space is allocated to each row.
• Advantages: Data can be added, modified, or read by row.
• Disadvantages: Some unnecessary data is obtained when data in a column is queried.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 186
Data Stored by Column

ID Name Phone Address

Data is stored by column in an underlying file system.


• Advantages: Data can be read or calculated by column.
• Disadvantages: When a row is read, multiple I/O operations may be required.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 187
HBase vs. RDB

• Distributed storage and column-oriented.


• Dynamic extension of columns.
HBase • Supports common commercial hardware, lowering the
expansion cost.

• Fixed data structure.


• Pre-defined data structure. RDB
• I/O intensive and cost-consuming expansion.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 188
Application Scenarios of HBase

HBase applies to the following scenarios:

• Massive data (TB and PB).


• The Atomicity, Consistency, Isolation, Durability (ACID) feature
supported by traditional relational databases is not required.
• High throughput.
• Efficient random reading of massive data.
• High scalability.
• Simultaneous processing of structured and unstructured data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 189
Position of HBase in FusionInsight
Application service layer
Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

HBase is a column-based distributed storage system that features high reliability,


performance, and scalability. It stores massive data and is designed to eliminate
limitations of relational databases in the processing of mass data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 190
KeyValue Storage Model (1)

ID Name Phone Address

Key-01 Value-ID01 Key-01 Value-Name01

Key-01 Value-Phone01 Key-01 Value-Address01

• KeyValue has a specific structure. Key is used to quickly query a data record, and Value is used to store user data.

• As a basic user data storage unit, KeyValue must store some description of itself, such as timestamp and type information. This
requires some structured space.

• Data can be expanded dynamically, adaptive to changes of data types and structures. Data is read and written by block. Different
Columns are not associated, so are tables.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 191
KeyValue Storage Model (2)
Partition mode of a KeyValue Database-based on continuous Key range.

Region_01 Region_02 Region_05 Region_06 Region_09 Region_10

Region_03 Region_04 Region_07 Region_08 Region_11 Region_12

Node1 Node2 Node3

Region_01 Region_05 Region_02 Region_06 Region_03 Region_07

Region_09 Region_04 Region_10 Region_12 Region_11 Region_08

• Data subregions are created based on the RowKey range (sorting based on a sorting algorithm such as the alphabetic order
based on RowKeys). Each subregion is a basic distributed storage unit.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 192
KeyValue Storage Model (3)

• The underlying data of HBase exists in the form of KeyValue. KeyValue has a specific format.
• KeyValue contains key information such as timestamp and type, etc.
• The same key can be associated with multiple Values. Each KeyValue has a qualifier.
• There can be multiple KeyValues associated with the same Key and Qualifier. In this case, they are distinguished
using timestamps. This is why there are multiple versions of the same data record.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 193
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 194
HBase Architecture (1)

Client ZooKeeper HMaster

HRegionServer HRegionServer
HRegion HRegion
HBase

Store MemStore Store MemStore Store MemStore Store MemStore


HLog

HLog
StoreFile StoreFile … StoreFile … … … StoreFile StoreFile … StoreFile … … …
HFile HFile HFile HFile HFile HFile

…DFS Client …DFS Client


HDFS

DataNode DataNode DataNode DataNode DataNode

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 195
HBase Architecture (2)

• Store: A Region consists of one or multiple Stores. Each store corresponds to a Column
RegionServer
Family.
• MemStore: A Store contains one MemStore. Data inserted to a Region by client is cached HLog
to the MemStore.
Region
• StoreFile: The data flushed to the HDFS is stored as a StoreFile in the HDFS.
Store MemStore
• HFile: HFile defines the storage format of StoreFiles in a file system. HFile is underlying
implementation of StoreFile. StoreFile StoreFile
… … …

• HLog: HLogs prevent data loss when a RegionServer is faulty. Multiple Regions in a HFile HFile
RegionServer share the same HLog.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 196
HMaster (1)
"Hey, Region A, please move to RegionServer 1!"
"RegionServer 2 was gone! Let others take over
it's Regions!"

RegionServer1 RegionServer1 RegionServer1

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 197
HMaster (2)

The HMaster process manages all the RegionServers.


• Handles RegionServer failovers.
The HMaster process performs cluster operations including
creating, modifying, and deleting tables.
The HMaster process migrates Regions.
• Allocates Regions when a new table is created.
• Ensures load balancing during operation.
• Takes over Regions after a RegionServer failover occurs.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 198
RegionServer

Region
• RegionServer is the data service process of HBase and is responsible for
processing reading and writing requests of user data.

• RegionServer manages Regions. All reading and writing requests of user


Region
data are handled based on interaction among Regions on RegionServers.

• Regions can be migrated between RegionServers.

RegionServer
Region

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 199
Region (1)

• A data table is divided horizontally into subtables based on the


KeyValue range to implement distributed storage. A subtable is called
a Region in HBase.
• Each Region is associated with a KeyValue range, which is described
using a StartKey and an EndKey.
• The HMaster process migrates Regions.
Each Region only needs to record a StartKey, because its EndKey serves as the StartKey of the next Region.
• Region is the most basic distributed storage unit of HBase.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 200
Region (2)

Row001 Row001
Row002 Region-1
Row002
……….. StartKey, EndKey
……….. Row010
Row010
Row011
Row011 Row012 Region-2
Row012 ……….. StartKey, EndKey
……….. Row020
Row020 Row021
Row021 Row022 Region-3
……….. StartKey, EndKey
Row022
Row030
………..
Row031
Row030
……….. Region-4
Row031 ……….. StartKey, EndKey
……….. ………..

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 201
Region (3)

META Region

Region Region Region Region Region

• Regions are categorized as Meta Region and User Region.


• Meta Region records routing information of User Regions.
• Perform the following steps to access data in a Region:
Search for the address of the Meta Region.

Search for the address of the User Regions in the Meta Region.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 202
Column Family

Region Region Region Region


/HBase/table
/region-1/ColumnFamily-1
/region-1/ColumnFamily-2

/region-2/ColumnFamily-1
/region-2/ColumnFamily-2
/HBase/table
/region-1
/region-3/ColumnFamily-1
/region-2
/region-3/ColumnFamily-2
/region-3
HDFS

• A ColumnFamily is a physical storage unit of a Region. Multiple column families of the same Region have different paths in HDFS.
• ColumnFamily information is table-level configuration information. That is, multiple Regions of the same table have the same column family information. (For
example, each Region has two column families and the configuration information of the same column family of different Regions is the same) .

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 203
ZooKeeper

ZooKeeper provides the following functions for HBase:


1.Distributed lock service
• Multiple HMaster processes will try registering a node in ZooKeeper and the node can be registered only by one HMaster process. The process
that successfully registers the node becomes the active HMaster process.

2.Event listening mechanism


• The active Hmaster's record is deleted after the active process fails and the standby processes will receive an update message which indicates
the Active HMaster is down.

3.Micro database roles


• ZooKeeper stores the addresses of RegionServers. In this case, ZooKeeper can be regarded as a micro database.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 204
MetaData Table
User Table 1
HBase: meta


• The MetaData Table HBase: Meta stores the information about Regions to
locate the Specific Region for Client. …

• The MetaData Table is into multiple Regions,


• and MetaData information of Region is stored


User Table N

in ZooKeeper. …


Mapping relation
MetaData Table

User table

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 205
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 206
Writing Process
RegionServer (on the first floor)

Region 1 General Biology

Region 2 Environmental Biology


Region 1 Region 2

Region 3 Palaeontology

Region 3 Region 4
Region 4 Physiochemistry
Region 5

Region 5 Biophysics

Book classification
storage

Physiochemistry Region 4 Palaeontology Region3

Rowtey 001 Rowtey 002 Rowtey 003

Rowtey 006 Rowtey 007 Rowtey 009

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 207
Client Initiating a Data Writing Request

Client

• The process of initiating a writing request by a client is like sending books to a library
by a book supplier. The book supplier must determine to which building and floor the
books should be sent.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 208
Writing Process - Locating a Region
1 Hi, META. Please send the bookshelf number, book number scope
(Rowkeys included in each Region), and information about the
floors where the bookshelves are located (RegionServers to which
the Regions belong) to me.

Bookshelf number
Region Book number
Rowkey
scope
Rowkey 070 Rowkey 071 Rowkey 072
Palaeontology
Region 3
Rowkey 075 Rowkey 006 Rowkey 078

Floor information
Regionserver

HClient
META

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 209
Writing Process - Grouping Data (1)

I have classified the books by


book number and I am going to
send the books to RegionServers.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 210
Writing Process - Grouping Data (2)

Data groups includes two


division steps:
• Find the information of region and regionserver of
tables based on the meta table.

• Transfer data to specific region according to rwokey.

Data on each RegionServer is


sent at the same time. In this
case, the data has been divided
by Region.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 211
Writing Process - Sending a Request to a RegionServer

RS 1 / RS 2 / RS 5, I am • Data is sent using the encapsulated RPC framework of HBase.


sending the books to you.

• Operations of sending requests to multiple RegionServers are implemented concurrently.


RegionServer

• After sending a data writing request, a client waits for the request processing result.

RegionServer • If the client does not capture any exception, it deems that all data has been written successfully. If
writing the data fails completely or partially, the client can obtain a detailed KeyValue list relevant to
the failure.

RegionServer

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 212
Writing Process - Process of Writing Data to a Region
RS1 RS2

path

Q1~Q3 Q4~Q5 Q1~Q3 Q4~Q5

Q8~Q1 Q8~Q1
Q6~Q7 0 Q6~Q7 0

Q11~Q12 Q11~Q12

RS5
I have stored the books in
sequence according to the book
number information provided by
HClient .
Q1~Q3 Q4~Q5

Q8~Q1
Q6~Q7 0

Q11~Q12

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 213
Writing Process - Flush

MemStore1

ColumnFamily-1 HFile

Region
MemStore-2

ColumnFamily-2 HFile

In either of the following scenarios, a Flush operation of MemStore is


triggered:
• The total usage of MemStore of a Region reaches the predefined Flush Size threshold.
• The ratio of occupied memory to total memory of RegionServer reaches the threshold.
• The number of WALs reaches the threshold.
• MemStore is updated every 1 hour by default.
• Users can flush a table or Region separately by a shell command.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 214
Impacts of Multiple HFiles

25

20
Read latency,ms.
15

10

0
0 3600 7200 10800 14400

Load test time,sec.

As time passes by, the number of HFiles increases and a query request will take much more
time.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 215
Compaction (1)

Compaction aims to reduce the number of small files in a column family in a Region,
thereby increasing reading performance.

There are two kinds of compaction: major and minor.

• Minor: compaction covering a small range. Minimum and • Major: compaction covering the HFiles in a column family in a
maximum numbers of files are specified. Small files at a Region. During major compaction, deleted data is cleared.
consecutive time duration are combined.

Files are selected based on a certain algorithm during minor compaction.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 216
Compaction (2)

Write
put MemStore

Flush

HFile HFile HFile HFile HFile HFile HFile

Minor Compaction

HFile HFile HFile

Major Compaction

HFile

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 217
Region Split

Parent Region
• A common Region splitting operation is performed to split a Region into
two subregions if the data size of the Region exceeds the predefined
threshold.

• During splitting, the split Region suspends the reading and writing
services. During splitting, data files of the parent Region are not split
and rewritten to the two subregions. Reference files are created in the
new Region to achieve quick splitting. Therefore, services of the Region
are suspended only for a short time.

• Routing information of the parent Region cached in clients must be


updated.
DaughterRegion-1 DaughterRegion-2

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 218
Reading Process
RegionServer (on the first floor)

Region 1 General Biology

Region 2 Environmental Biology


Region 1 Region 2

Region 3 Palaeontology

Region 3 Region 4
Region 4 Physiochemistry
Region 5

Region 5 Biophysics

Book classification
storage

Physiochemistry Region 4 Palaeontology Region3

Rowtey 001 Rowtey 002 Rowtey 003 Rowkey 001 Rowkey 002 Rowkey 003

Rowtey 006 Rowtey 007 Rowtey 009 Rowkey 006 Rowkey 007 Rowkey 009

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 219
Client Initiating a Data Reading Request

Get

 When a precise key is provided, the


Get operation is performed to read a
single row of user data.

Scan

 The Scan operation is to batch scan


Client
user data of a specified Key range.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 220
Locating a Region
1
Hi, META, I want to look for books whose code ranges is from
xxx to xxx, please find the bookshelf number and the floor
information within the code range.

Bookshelf number
Region Book number
Rowkey
scope
Rowkey 070 Rowkey 071 Rowkey 072
Palaeontology
Region 3
Rowkey 075 Rowkey 006 Rowkey 078

Floor information
Regionserver

HClient
META

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 221
OpenScanner

ColumnFamily-1
MemStore
• HFile-11
• HFile-12
Region
ColumnFamily-2
MemStore
• HFile-21
• HFile-22

D uring the OpenScanner process, scanners corresponding to MemStore and each HFile are
created:
• The scanner corresponding to HFile is StoreFileScanner.
• The scanner corresponding to MemStore is MemStoreScanner.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 222
Filter

F ilter allows users to set filtering criteria during


the Scan operation. Only user data that meets the
 Satisfied Row

criteria returns.  Satisfied Row

T here are some typical Filter types:


• RowFilter
 Satisfied Row

• SingleColumnValueFilter
• KeyOnlyFilter
• FilterList

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 223
BloomFilter

BloomFilter is used to optimize scenarios where data is randomly read, that is,
scenarios where the Get operation is performed. It can be used to quickly check
whether a piece of user data exists in a large dataset (most data in the dataset
cannot be loaded to the memory).

A certain error rate exists when BloomFilter checks whether a piece of


data exits. Nevertheless, the conclusion indicated by the message
“User data XXXX does not exist” is accurate.

The data relevant to BloomFilter of HBase is stored in HFiles.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 224
CONTENTS
01 02 03 04
Introduction to Functions and Key Processes of Huawei Enhanced Features
HBase Architecture of HBase HBase of HBase

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 225
Supporting Secondary Index

• The secondary index enables HBase to query data based on specific column values.

Column Family A Column Family B


RowKey A: Name A: Addr A: Age B: Mobile B: Email

01 ZhangSan Beijing 23 6875349 ……

02 LiLei Hangzhou 43 6831475 ……

03 WangWu Shenzhen 35 6809568 ……

04 …… Wuhan 28 6812645 ……

05 …… Changsha 26 6889763 ……

06 …… Jinan 35 6854912 ……

• When the secondary index is not used, the mobile field needs to be matched in the entire table by row to search for
specified mobile numbers such as “68XXX” which results in long time delay.
• When the secondary index is used, the index table is searched first to identify the location of the mobile number,
which narrows down the search scope and reduces the time delay.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 47
226
HFS

• HBase FileStream (HFS) is a separate module of HBase. As an encapsulation of HBase and HDFS interfaces, HFS
provides capabilities, such as storing, reading and deleting files for upper-level applications.

• HFS provides the ability of storing massive small files and large files in HDFS. That is, massive small files (less than
10MB) and some large files (larger than 10MB) can be stored in HBase.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 227
HBase MOB (1)

MOB Data (100KB to 10MB) is directly stored in the file system (HDFS for
example) as HFile. And the information about address and size of file is stored in
HBase as a value. With tools managing these files, the frequency of compation
and split can be highly reduced, and performance can be improved.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 228
HBase MOB (2)

Client ZooKeeper HMaster

HRegionServer HRegionServer

HRegion HRegion HRegion


HLog

HLog
… MOB … MOB …
Store Store Store … Store …

HDFS

Storefiles Storefiles Storefiles Storefiles MOB files


HFile HFile HFile HFile HFile

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 229
Summary
This module describes the following information about HBase: KeyValue Storage Model, technical architecture,
reading and writing process and enhanced features of FusionInsight HBase.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 230
Quiz

• Can the services of the Region in HBase be provided when splitting?


• What are the advantages of the Region splitting of HBase?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 231
Quiz

• What is Compaction used for? ( )


A. Reducing the number of files in a column family and Region.
B. Improving data reading performance.
C. Reducing the number of files in a column family.
D. Reducing the number of files in a Region.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 232
Quiz

• What is the physical storage unit of HBase? ( )


A. Region.
B. Column Family.
C. Column.
D. Cell.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 233
More Information

• Download training materials:


– http://support.huawei.com/learning/trainFaceDetailAction?lang=en&pbiPath=term1000121094&courseId=Node1000011807

• eLearning course:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi?navId=_40&lang=en

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 234
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of Hive

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Foreword

B ased on Hive provided by the Hive Open Source


community, Hive in FusionInsight HD has various
enterprise-level customization features, such as
Colocation table creation, column encryption, and
syntax enhancement. With these features,
FusionInsight HD outperforms the community
version in terms of reliability, tolerance, scalability,
and performance.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 237
Hive application scenarios and basic
principles A
Objectives
Upon completion of this course, you will be able
Enhanced features of FusionInsight Hive
B
to know:
Common Hive SQL statements
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 238
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 239
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 240
Hive Overview

Hive is a data warehouse tool running on Hadoop and supports PB-


level distributed data query and management.

Hive provides the following functions:

• Flexible ETL (extract / transform / load).

• Supporting computing engines, such as MapReduce, Tez, and Spark.

• Direct access to HDFS files and HBase.

• Easy to use and program.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 241
Hive Overview

Hive is a data warehouse tool running on Hadoop and supports PB-


level distributed data query and management.

Hive provides the following functions:

• Flexible ETL (extract / transform / load).

• Supporting computing engines, such as MapReduce, Tez, and Spark.

• Direct access to HDFS files and HBase.

• Easy to use and program.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 242
Application Scenarios of Hive

Data mining
Data aggregation
• Interest analysis
• Daily / Weekly click count
• User behavior analysis
• Traffic statistics
• Partition demonstration

Non-real-time Data warehouse


analysis • Data extraction
• Log analysis • Data loading
• Text analysis • Data transformation

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 243
Position of Hive in FusionInsight
Application service layer

Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

Hive is a data warehouse tool, which employs HiveQL (SQL-like) to query


data. All Hive data is stored in HDFS.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 244
Comparison Between Hive and Traditional Data
Warehouses (1)
Hive Traditional Warehouse

Cluster, which is of limited storage capacity. The cluster calculation speed


decreases dramatically when the storage capacity increases. It is applicable only to
Storage HDFS. Theoretically, it is infinitely scalable.
commercial applications that involve a small amount of data, and cannot handle an
extra-large amount of data.

Execution An algorithm with higher efficiency can be used to query data. More optimization
MapReduce /Tez / Spark
engine measures can be taken to improve the efficiency.
Usage HQL (similar to SQL) SQL
Metadata and data are stored separately for
Flexibility Low. Data is used for limited purposes.
decoupling.

The calculation speed depends on cluster size. Hive


It is fast when there is a small amount of data. Nevertheless, the speed decreases
Analysis speed is scalable. It is more efficient than traditional data
dramatically when the amount of data becomes large.
warehouses when there is a large amount of data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 9245
Comparison Between Hive and Traditional Data
Warehouses (2)
Hive Traditional Data Warehouses
Low efficiency. It has not met expectations
Index currently. Efficient.

An application model must be developed. This It provides a set of well-developed report solutions to facilitate data
Usability
results in high flexibility but low usability. analysis.

Data is stored in HDFS, which features high It has relatively low reliability. When a query attempt fails, the query must
Reliability
reliability and high fault tolerance. be restarted. Data fault tolerance relies on hardware RAID.

Environment
It can be deployed using common computers. It requires high-performance commercial servers.
dependence

Price Open-source product. The data warehouses used for commercial purposes are expensive.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
246
Advantages of Hive

Advantages of Hive

1 2 3 4
High reliability
and tolerance SQL-like query Scalability Multiple interfaces
• HiveServer in • SQL-like syntax • User defined storage • Beeline
cluster mode • Built-in functions in format • JDBC
• Dual-MetaStore large quantity • User defined • Thrift
• Query retry after functions (UDF / • Python
timeout UDAF / UDTF) • ODBC

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 247
Disadvantages of Hive

Disadvantages of Hive

1 2 3 4
Not support Inapplicable to Not support
High latency materialized OLTP storage process
• MapReduce • Does not support • Does not support • Does not support
execution engine materialized views. column-level data storage process,
by default. • Data updating, adding, updating, but supports logic
• High latency of insertion, and deletion and deletion. processing using
MapReduce. cannot be performed UDF.
on views.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 248
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 249
Hive Architecture

Hive

JDBC ODBC

Web
Command Line Interface Thrift Server
Interface

Driver
MetaStore
(Compiler, Optimizer, Executor)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 250
Hive Architecture in FusionInsight HD

Hive contains HiveServer, MetaStore, and


WebHcat.
Hiveserver (s) WebHcat (s)

• HiveServer: receives requests from clients, parses and executes


HQL commands, and returns query results. Metastore (s)
• MetaStore: provides metadata services.
• WebHcat: provides HTTP services, such as metadata, Data
Defined Language (DDL) for external users. DBService / HDFS /
YARN

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 251
Architecture of WebHCat

WebHCat provides Rest interface for users to make the following operations through safe
HTTPS protocol:
• Hive DDL operations
• Running Hive HQL task
• Runing MapReduce task

WebHCat
Server (s) HCat
also known as
Templeton DDL
Server (s)

Queue
Pig. Hive
WebHCat HDFS
MapReduce
Job_ID

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 252
Data Storage Model of Hive

Database

Table Table

Partition
Bucket
Bucket
Skewed Normal
Partition

Bucket

Bucket
data data

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 253
Data Storage Model of Hive - Partition and
Bucket
Partition: A data table can be divided into partitions
by using a field value.
• Each partition is a directory.
• The number of partitions is configurable.
• A partition can be partitioned or bucketed.

Bucket: Data can be stored in different buckets.


• Each bucket is a file.
• The bucket quantity is set when a table is created and data can be sorted in the
bucket.
• Data is stored in a bucket by the hash value of a field.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 254
Data Storage Model of Hive - Partition and
Bucket
Partition: A data table can be divided into partitions
by using a field value.
• Each partition is a directory.
• The number of partitions is configurable.
• A partition can be partitioned or bucketed.

Bucket: Data can be stored in different buckets.


• Each bucket is a file.
• The bucket quantity is set when a table is created and data can be sorted in the
bucket.
• Data is stored in a bucket by the hash value of a field.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 255
Data Storage Model of Hive - Managed Table and
External Table
Hive can create managed table and external table:
• Managed tables are created by default and managed by Hive. In this case, Hive migrates data to data warehouse directories.
• When external tables are created, Hive access data from locations outside data warehouse directories.
• Use managed tables when Hive performs all operations.
• Use external tables when Hive and other tools share the same data set for different processing.

Managed Table External Table


CREATE / Data is migrated to data The location of external data is specified
LOAD warehouse directories. when a table is created.

DROP Metadata and data are deleted. Only metadata is deleted.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 256
Data Storage Model of Hive - Managed Table and
External Table
Hive can create managed table and external table:
• Managed tables are created by default and managed by Hive. In this case, Hive migrates data to data warehouse directories.
• When external tables are created, Hive access data from locations outside data warehouse directories.
• Use managed tables when Hive performs all operations.
• Use external tables when Hive and other tools share the same data set for different processing.

Managed Table External Table


CREATE / Data is migrated to data The location of external data is specified
LOAD warehouse directories. when a table is created.

DROP Metadata and data are deleted. Only metadata is deleted.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 257
Functions of Hive

Built-in functions in Hive:


• Mathematical Function, such as round( ), floor( ), abs( ), rand( ), etc.
• Date Function, such as to_date( ), month( ), day( ), etc.
• String Function, such as trim( ), length( ), substr( ), etc.

UDF (User-Defined Function)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 258
Enhanced Features of Hive - Colocation
Overview
• Colocation: storing associated data or data on which associated operations are performed on
the same storage node.
• File-level Colocation allows quick file access. This avoids network consumption caused by
data migration.

NN #1

A C D A B D B C B C A D
DN #1 DN #2 DN #3 DN #4 DN #5 DN #6

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 259
Enhanced Features of Hive - Using Colocation

Step 1: Use a HDFS interface to create groupid and locatorid.

hdfs colocationadmin -createGroup -groupId groupid -locatorIds


locatorid1,locatorid2,locatorid3;

Step 2: Use the Hive Colocation function.

CREATE TABLE tbl_1 (id INT, name STRING) stored as RCFILE


TBLPROPERTIES("groupId"="group1","locatorId"="locator1");

CREATE TABLE tbl_2 (id INT, name STRING) row format delimited
fields terminated by '\t' stored as TEXTFILE
TBLPROPERTIES("groupId"="group1","locatorId"="locator1");

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 260
Enhanced Features of Hive - Encrypting
Columns
Step 1: When creating a table, specify the columns to be
encrypted and the encryption algorithm.

create table encode_test (id INT, name STRING, phone STRING,


address STRING) row format serde
"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
WITH SERDEPROPERTIES(
"column.encode.columns"="phone,address","column.encode.
classname"="org.apache.hadoop.hive.serde2.AESRewriter"
);

Step 2: Use an insert syntax to import data to tables whose


columns are encrypted.

insert into table encode_test select id, name, phone, address


from test;

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 261
Enhanced Features of Hive - Deleting HBase
Records in Batches

Overview:
• In FusionInsight HD, Hive allows deletion of a single record from an HBase table. Hive can use specific syntax to delete one or more data
records that meet criteria from its HBase tables.

Usage:
• To delete some data from an HBase table, run the following HQL statement:

remove table HBase_table where expression;

here, expression indicates the criteria for selecting the data to be deleted.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 262
Enhanced Features of Hive - Controlling Traffic

By using the traffic control feature, you can control:


• Total number of established connections
• Number of established connections of each use
• Number of connections established within a unit period

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 25
263
Enhanced Features of Hive -
Specifying Row Delimiters
Step 1: Set inputFormat and outputFormat when creating a table.

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]


[db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[ROW FORMAT row_format]
STORED AS
inputformat
"org.apache.hadoop.hive.contrib.fileformat.SpecifiedDelimiterInputFormat"
outputformat "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

Step 2: Specify the delimiter before a query.

set hive.textinput.record.delimiter=“!@!“;

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 264
CONTENTS
01 02 03
Introduction to Hive Hive Functions and Basic Hive
Architecture Operations

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 265
Hive SQL Overview

DDL-Data definition language


• Table creation, table modification and deletion, partitions, and data types.

010101010101
010101010101 DML-Data manipulation language
010101010101 • Data import, export.

DQL-Data query language


• General query.
• Complicated query, like Group by, Order by, Join, etc.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 266
Hive Basic Operations (1)
Data format example: 1,huawei,1000.0
• Create managed table.

CREATE TABLE IF NOT EXISTS example.employee(


Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',)
ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' STORED AS TEXTFILE;

• Create external table.


CREATE EXTERNAL TABLE IF NOT EXISTS
example.employee(
Id INT COMMENT 'employeeid',
Company STRING COMMENT 'your company',
Money FLOAT COMMENT 'work money',) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/localtest';

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 267
Hive Basic Operations (2)
• Modify a column.
ALTER TABLE employee1 CHANGE money string COMMENT 'changed
by alter' AFTER dateincompany;

• Add a column.
ALTER TABLE employee1 ADD columns(column1 string);

• Modify the file format.


ALTER TABLE employee3 SET fileformat TEXTFILE;

• Delete table data.


DELETE column_1 from table_1 WHERE column_1=??;
DROP table_a;

• Describe table.
DESC table_a;

• Show the statements for creating a table.


SHOW CREATE table_a;

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 268
Hive Basic Operations (3)
• Load data from the local.
LOAD DATA LOCAL INPATH 'employee.txt' OVERWRITE INTO TABLE
example.employee;

• Load data from another table.


INSERT INTO TABLE company.person PARTITION(century=
'21',year='2010')
SELECT id, name, age, birthday FROM company.person_tmp WHERE
century= '23' AND year='2010';
• Export data from a Hive table to HDFS.
EXPORT TABLE company.person TO '/department';

• Import data from HDFS to a Hive table.


IMPROT TABLE company.person FROM '/department';

• Insert data.
INSERT INTO TABLE company.person
SELECT id, name, age, birthday FROM company.person_tmp
WHERE century= '23' AND year='2010';

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 269
Hive Basic Operations (4)
• WHERE.
SELECT id, name FROM employee WHERE salary >= 10000;

• GROUP BY.
SELECT department, avg(salary) FROM employee GROUP BY department;

• UNION ALL.
SELECT id, salary, date FROM employee_a UNION ALL
SELECT id, salary, date FROM employee_b;

• JOIN.
SELECT a.salary, b.address FROM employee a JOIN employee_info
b ON a.name=b.name;

• Subquery.
SELECT a.salary, b.address FROM employee a JOIN (SELECT
address FROM employee_info where province='zhejiang') b ON
a.name=b.name;

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 270
Summary
This module describes the following information about Hive: basic principles, application scenarios, enhanced
features in FusionInsight and common Hive SQL statements.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 271
Quiz

• Which of the following scenarios does Hive apply to?


A. Real-time online data analysis.
B. Data mining (user behavior analysis, interest analysis, and partition demonstration).
C. Data aggregation (daily / weekly click count and click count rankings).
D. Non-real-time data analysis (log analysis and statistics analysis).

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 272
Quiz

• Which of the following statements about Hive SQL operations are correct?

A. The keyword external is used to create an external table and the key word internal is used to create a common
table.
B. Specify the location information when creating an external table.
C. When data is uploaded to Hive, the data source must be one HDFS path.
D. When creating a table, column delimiters can be specified.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 273
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 274
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of Streaming

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Real-time stream processing
A
System architecture of Streaming
B
Objectives
Upon completion of this course, you will be able
to know:
Key features of Streaming
C
Basic CQL concepts
D
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 277
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 278
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 279
Streaming Overview

S treaming is a distributed real-time computing


framework based on the open source Storm with the
You Tube Facebook WeChat Weibo

following features: No waiting; Results delivered in-flight

• Real-time response with low delay Event


Data
Alerts
Actions

• Data computing before storing


Queries
• Continuous query
STORAGE
• Event-driven

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 280
Application Scenarios of Streaming

Streaming is applicable to the following scenarios:


Real-time
Real-time analysis: Real-time statistic:
recommendation:

Real-time log processing and Real-time website access statistics Real-time advertisement
vehicle traffic analysis and sorting positioning and event
marketing

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 281
Position of Streaming in FusionInsight
Application service layer

OpenAPI / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Streaming Flink governance

Hadoop YARN / ZooKeeper LibrA


Security
management
HDFS / HBase

Streaming is a distributed real-time computing framework, widely used in


real-time business.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 282
Comparison with Spark Streaming
t1

live lnput Stream t2


Spark t3
t t t streaming …

tn
Generate RDD and start
Spark
r1 r2 …
Task Scheduler Spark batch jobs
to execute RDD
batches of results Memory Manager transformations

Micro-batch processing by Spark Streaming Stream processing by Streaming

Spark Streaming Streaming


Instant execution logic startup and Execution logic startup before execution, and
Task execution mode
reclamation upon completion logic persistence
Event processing Processing started upon accumulation of a
Real-time event processing
mode certain number of event batches
Delay Second-level Millisecond-level
Throughput High (2 to 5 times that of Streaming) Average

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 283
Comparison of Application Scenario

Real-time Performance

Streaming

Spark Streaming
Time
milliseconds seconds

• Streaming applies to delay-sensitive services.


• Spark Streaming applies to delay-insensitive services.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 284
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 285
Basic Concepts (1)

Topology
A real-time application in Streaming.
Nimbus
Assigns resources and schedules tasks.
Supervisor
Receives tasks assigned by Nimbus, and starts/stops Worker processes.

Worker
Runs component logic processes.
Spout
Generates source data flows in a topology.
Bolt
Receives and processes data in a topology.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 286
Basic Concepts (1)

Topology
A real-time application in Streaming.
Nimbus
Assigns resources and schedules tasks.
Supervisor
Receives tasks assigned by Nimbus, and starts/stops Worker processes.

Worker
Runs component logic processes.
Spout
Generates source data flows in a topology.
Bolt
Receives and processes data in a topology.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 287
Basic Concepts (2)

Task
A Spout or Bolt thread of Worker.

Tuple
Core data structure of Streaming. It is basic message delivery unit in key-
value pairs, which can be created and processed in a distributed way.

Stream
An infinite continuous Tuple sequence.

Zookeeper
Provides distributed collaboration services for processes. Active / Standby Nimbus,
Supervisor, and Worker register their information in ZooKeeper. This enables Nimbus to
detect the health status of all roles.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 288
System Architecture

Submits a Monitors the heartbeat


topology. and assigns tasks.
Client Nimbus
ZooKeeper

Downloads a JAR package.


ZooKeeper

Supervisor Supervisor Obtains tasks


ZooKeeper
Starts Worker

Worker Executor
Worker Reports the heartbeat.
Worker Executor

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 289
Topology
• A topology is a directed acyclic graph (DAG) consisting of Spout (data source) and Bolt (for logical
processing). Spout and Bolt are connected through Stream Groupings.
• Service processing logic is encapsulated in topologies in Streaming.

Filters data
Obtains stream data
from external data
sources Bolt A Bolt B

Spout Triggers external


messages
Bolt C

Persistent
archiving

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 290
Worker

Worker
A Worker is a JVM process and a topology runs in one or more
Workers. A started Worker runs all the way unless manually Worker Process
stopped. The number of Worker processes depends on the
topology setting, and has no upper limit. The number of Worker
processes that can be scheduled and started depends on the
Executor
number of slots configured in Supervisor.
Executor
Task
Executor
Task
In a Worker process runs one or more Executor threads.
Each Executor can run one or more task instances of either
Spout or Bolt. Executor
Task
Task Task

A unit that processes data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 291
Task

B oth Spout and Bolt in a topology support concurrent running. In the topology, you can specify
the number of concurrently running tasks on each node. Streaming assigns tasks in the cluster to
enable simultaneous calculation and enhance processing capability of the system.

Stream Bolt A
Grouping Bolt B

Spout

Bolt C

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 292
Message Delivery Policies

Grouping Mode Description


Delivers messages in groups to tasks of the target Bolt according
fieldsGrouping (field grouping)
to message hash values.
globalGrouping (global grouping) Delivers all messages to a fixed task of the target Bolt.
shuffleGrouping (shuffle grouping) Delivers messages to a random task of the target Bolt.
Delivers messages randomly to tasks if one or more tasks exist
localOrShuffleGrouping (local or shuffle
in the target Bolt process, or delivers messages in shuffle
grouping)
grouping mode.
allGrouping (broadcast grouping) Delivers messages to all tasks of the target Bolt.
Delivers messages to the task of the target Bolt specified by the
directGrouping (direct grouping) data producer. The task ID needs to be specified by using the
emitDirect (taskID, tuple) interface.
partialKeyGrouping (partial field grouping) Balanced field grouping.
noneGrouping (no grouping) Same as shuffle grouping.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
293
Message Delivery Policies

Grouping Mode Description


Delivers messages in groups to tasks of the target Bolt according
fieldsGrouping (field grouping)
to message hash values.
globalGrouping (global grouping) Delivers all messages to a fixed task of the target Bolt.
shuffleGrouping (shuffle grouping) Delivers messages to a random task of the target Bolt.
Delivers messages randomly to tasks if one or more tasks exist
localOrShuffleGrouping (local or shuffle
in the target Bolt process, or delivers messages in shuffle
grouping)
grouping mode.
allGrouping (broadcast grouping) Delivers messages to all tasks of the target Bolt.
Delivers messages to the task of the target Bolt specified by the
directGrouping (direct grouping) data producer. The task ID needs to be specified by using the
emitDirect (taskID, tuple) interface.
partialKeyGrouping (partial field grouping) Balanced field grouping.
noneGrouping (no grouping) Same as shuffle grouping.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
294
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 295
Nimbus HA

ZooKeeper cluster

Streaming cluster

Active Nimbus Standby Nimbus

Supervisor Supervisor Supervisor



worker worker worker worker worker worker

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 296
Disaster Recovery

S ervices are automatically migrated from faulty nodes to normal ones, preventing
service interruptions.

Node1 Node2 Node3


Topo1 Topo1 Topo1

Topo2 Topo3 Topo4

Zero
manual
operation
Node1 Node2 Node3
Topo1 Topo1 Topo1

Topo2 Topo3 Topo4


Topo1 Topo3

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 297
Message Reliability
Reliability Processing
Description
Level Mechanism
This mode involves the highest throughput and applies to messages with
At Most Once None
low reliability requirements.
This mode involves low throughput and applies to messages with high
At Least Once Ack
reliability problems. All data must be completely processed.
Trident is a special transactional API provided by Storm and involves the
Exactly Once Trident
lowest throughput.

W hen a tuple is completely processed in Streaming, the tuple and all its derived tuples are successfully
processed. A tuple fails to be processed if the processing is not complete within the timeout period.

B
B
A D
A
C
C
E

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 298
Ack Mechanism

• When Spout sends a tuple, it notifies Acker that a new root


message is generated. Acker will create a tuple tree and initializes Ack6
Spout
the checksum to 0.
Ack1
• When Bolt sends a message, it sends an anchor tuple to Acker to
refresh the tuple tree, and reports the result to Acker after the Bolt1 Bolt2 Ack2
message is sent successfully. If the message is sent successfully, Acker
Ack3
Acker refreshes the checksum. If the message fails to be sent,
Acker immediately notifies Spout of the failure. Ack4
Bolt3 Bolt4
• When a tuple tree is completely processed (checksum = 0),
Acker notifies Spout of the result.
Ack5

• Spout provides ack () and fail () functions to process Acker results.


The fail () function implements message resending logic.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 299
Reliability Level Setting

If not every message is required to be processed (allowing some message loss), the
reliability mechanism can be disabled to ensure better performance.

The reliability mechanism can be disabled in the following ways:

• Setting Config. • Using Spout to send messages • Using Bolt to send messages in
TOPOLOGY_ACKERS to 0. through interfaces that do not restrict Unanchor mode.
message IDs.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 300
Streaming and Other Components

HDFS, HBase, Kafka...

Kafka Streaming HDFS

Topic1 Topology1 Redis

Topic2 Topology2 HBase

Kafka
Topic N Topology N
……

External components such as HDFS and HBase are integrated to facilitate real-time
offline analysis.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 301
CONTENTS
01 02 03 04
Introduction to System Key Features Introduction to
Streaming Architecture StreamCQL

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 302
StreamCQL Overview

S treamCQL (Stream Continuous Query Language) is a query language based on the distributed stream
processing platform based on and can be built on various stream processing engines (mainly Apache Storm).

Currently, most stream processing platforms provide only distributed processing capabilities but involve complex
service logic development and poor stream computing capabilities. The development efficiency is low due to low reuse
and repeated development. StreamCQL provides various distributed stream computing functions, including traditional
SQL functions such as filtering and conversion, and new functions such as stream-based time window computing,
window data statistics, and stream data splitting and merging.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 26
303
StreamCQL Easy to Develop
//Def Input:
public void open(Map conf,
TopologyContext context,
SpoutOutputCollector collector) {…} StreamCQL
public void nextTuple() {…}
public void ack(Object id) { …}
public void --Def Input:
declareOutputFields(OutputFieldsDeclarer CREATE INPUT STREAM S1 …
declarer) {…}
//Def logic:
public void execute(Tuple tuple,
BasicOutputCollector collector) {…} --Def logic:
public void INSERT INTO STREAM filterstr SELECT *
declareOutputFields(OutputFieldsDeclarer FROM S1 WHERE name="HUAWEI";
ofd) {…}
//Def Output:
public void execute(Tuple tuple, --Def Output:
BasicOutputCollector collector) {…} CREATE OUTPUT STREAM S2…
public void
declareOutputFields(OutputFieldsDeclarer
ofd) {…} --Def Topology:
//Def Topology: SUBMIT APPLICATION test;
public static void main(String[] args) throws
Exception {…}

Native Storm API


Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 304
StreamCQL and Stream Processing Platform

Service interface CQL IDE

Function
Join Aggregate Split Merge Pattern Matching

Stream Window

Engine
Storm Other stream processing engines

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 305
Summary
This module describes the following information about Streaming:
• Definition
• Application Scenarios
• Position of Streaming in FusionInsight
• System architecture of Streaming
• Key features of Streaming
• Introduction to StreamCQL

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 306
Quiz


How is message reliability guaranteed in
Streaming?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 307
Quiz

• Which of the following statements about Supervisor is CORRECT?

A. Nimbus HA supports hot failover and eliminates single points of failure.


B. Supervisor faults can be automatically recovered without affecting running services.
C. Worker faults can be automatically recovered.
D. Tasks on a faulty node of the cluster will be re-assigned to other normal nodes.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 308
Quiz

• Which of the following statements about Supervisor is CORRECT?

A. Supervisor assigns resources and schedules tasks.


B. Supervisor receives tasks assigned by Nimbus, and starts / stops Worker processes.
C. Supervisor runs component logic processes.
D. Supervisor receives and processes data in a topology.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 309
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 310
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of Flink

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical principles
of Flink A
Objectives
After completing this course, you will be able to
Key features of Flink
B
understand:
Flink integration in FusionInsight HD
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 313
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture FusionInsight HD
of Flink

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 314
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture FusionInsight HD
of Flink

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 315
Flink Overview

• Flink is a unified computing framework that supports both batch


processing and stream processing. It provides a streaming data
processing engine that supports data distribution and parallel
computing. Flink features stream processing, and is a top open-source
stream processing engine in the industry.

• Flink, similar to Storm, is an event-driven real-time streaming system.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 316
Key Features of Flink

Flink

Streaming-first Fault-tolerant Scalable Excellent performance

Stream Reliability and Scaling out to High throughput


processing engine checkpoint mechanism over 1000 nodes and low latency

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 6317
Key Features of Flink

Low Latency Exactly Once HA Scale-out

Millisecond-level Asynchronous snapshot Active / standby Manual scale-out


processing capability. mechanism, ensuring JobManagers, preventing supported by
that all data is processed single points of failure TaskManagers.
only once. (SPOFs).

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 318
Application Scenarios of Flink

Flink provides high-concurrency data processing,


millisecond-level latency, and high reliability, making it
extremely suitable for low-latency data processing
scenarios.

Typical scenarios:
• Internet finance services.
• Clickstream log processing.
• Public opinion monitoring.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 319
Hadoop Compatibility

red

map join

join

Flink supports YARN and can obtain data from the Hadoop distributed file system (HDFS) and HBase.

Flink supports all formatted input and output of Hadoop.

Flink supports the Mappers and Reducers of Hadoop, which can be used together with Flink operations.

Flink can run Hadoop jobs faster.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 320
Performance Comparison of Stream Computing
Frameworks
Storm & Flink Identity Single-Thread Throughput
400000
350466.22
350000
Throughput (pieces/s)

300000 277792.60

250000

200000

150000
87729.76
100000 76519.48

50000

0
1 partition source 8 partition source
Storm Flink

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 321
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture FusionInsight HD
of Flink

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 322
Flink Technology Stack

Machine Learning

Graph Processing
Event Processing

Relational

Relational
FlinkML
APIs & Libraries

Table

Table
Gelly
CEP

DataStream API DataSet API


Stream Processing Batch Processing

`
Core

Runtime
Distributed Streaming Dataflow
Deploy

Local Cluster Cloud


Single JVM Standalone, YARN GCE, EC2

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 323
Core Concept of Flink - DataStream

D ataStream: Flink uses data streams to represent DataStream in applications. DataStream can be considered as
an unchangeable collection of duplicate data. The number of DataStream elements is unlimited.

window (…). apply (…)


JoinedStreams
connect (DataStream)
ConnectedStreams
join (DataStream) map ( )
flatMap ( )

windowAll ( )

CoGroupedStreams
DataStream AllWindowedStream
reduce ( )
window (…). apply (…) fold ( )
sum ( )
max ( )
coGroup (DataStream)

keyBy ( )

KeyedeStream

window ( )

WindowedStream

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 324
DataStream

Data source: indicates the streaming data source, which can be HDFS files, Kafka
data, or texts.

Transformations: indicates streaming data conversion.

Data sink: indicates data output, which can be HDFS files, Kafka data, or texts.

Data Source Transformations Data Sink

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 325
Data Source of Flink

Batch processing Stream processing

Files Files
• HDFS, local file system, and Socket streams
MapR file system Kafka
• Text, CSV, Avro, and RabbitMQ
Hadoop input formats
Flume
JDBC Collections
HBase Implement your own
Collections • SourceFunction. collect

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 326
DataStream Transformations

Common
transformations

public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper)


public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper)
public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter)
public KeyedStream<T, Tuple> keyBy(int... fields)
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field)
public DataStream<T> rebalance()
public DataStream<T> shuffle()
public DataStream<T> broadcast()
public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 327
DataStream Transformations

flatMap

writeAsText
HDFS 1 3 Window / Join

HDFS
textFile

map keyBy 6
2 4 5

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 328
Flink Application Running Process - Key Roles

• Indicates the request initiator, which submits application requests and creates the
Client data flow.

• Manages the resources for applications. JobManager applies to ResourceManager


JobManager for resources based on the requirements of applications.

ResourceManager • Indicates the resource management department, which schedules and allocates the
of YARN resources of the entire cluster in a unified manner.

• Performs computing work. An application will be split and assigned to multiple


TaskManager TaskManagers for computing.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 329
Flink Job Running Process
(Worker) (Worker)
TaskManager TaskManager
Task Task Task Task Task Task
Slot Slot Slot Slot Slot Slot
Task Task Task Task

Memory & I/O Manager Memory & I/O Manager


Network Manager Data Streams Network Manager
Actor System Actor System
Flink Program
Task Status
Program Heartbeats Deploy / Stop /
Program code Statistics Cancel Tasks
Dataflow …
Trigger
Optimizer / Client Checkpoints
Status …
Graph Builder Statistics &
Actor updates
results
JobManager
System
Dataflow graph Actor System
Submit job
(send dataflow) Cancel /
update job Dataflow Graph Scheduler

Checkpoint
Coordinator

(Master / YARN Application Master)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 330
Flink on YARN

YARN Resource
Manager

2.Register resources
and request AppMaster
container
3.Allocate AppMaster Container

“Master” Node YARN Container YARN Container YARN Container

Flink Flink Flink Flink


YARN Client JobManager TaskManager TaskManager

YARNApp. 4.Allocate Worker


Master

1.Store Uber jar


And configuration

HDFS Always Bootstrap containers


With Uber jar and config

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 331
Technical Principles of Flink (1)

• A Flink application consists of streaming data and transformation operators.


• Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operator
that takes one or more streams as input, and produces one or more output streams as a result.

Stream
Transformation

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 332
Technical Principles of Flink (2)

T he source operator is used to load streaming data. Transformation operators, such as map ( ), keyBy ( ),
and apply ( ), are used to process streaming data. After streaming data is processed, the sink writes the
processed streaming data into related storage systems, such as HDFS, HBase, and Kafka.

Source Operator Transformation Operator Sink Operator

keyBy (
Source map ( ) ) Sink
apply ( )

Stream

Streaming Dataflow

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 333
Parallel DataStream of Flink
Streaming Dataflow (condensed view)

keyBy ( )
Source map ( ) Sink
apply ( )

Operator Stream

Source map ( )
[1] keyBy ( )
[1]
apply ( )
[1]
Operator Stream Sink
Subtask Partition parallelism = 2 [1]

keyBy ( )
Source map ( ) parallelism = 1
apply ( )
[2] [2]
[2]

Streaming Dataflow (parallelized view)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 334
Operator Chain of Flink
Streaming Dataflow (condensed view)

Source map() keyBy()


Sink
apply()

Operator Chain Task

keyBy()
Source map() apply()
[1] [1] [1]

Sink
Subtask (=thread) Subtask (=thread) [1]

keyBy()
Source map()
apply()
[2] [2]
[2]

Streaming Dataflow (parallelized view)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 335
Windows of Flink

F link supports operations based on time windows and operations based on data windows.
• Categorized by splitting standard: time windows and count windows.
• Categorized by window action: tumbling windows, sliding windows, and custom windows.

Event
Time windows

Count (3) Windows Event stream

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 336
Common Window Types of Flink (1)

T umbling windows, whose times do not overlap.

window 1 window 2 window 3 window 4 window 5


user1

user2

user3

window size time

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 337
Common Window Types of Flink (2)

S liding windows, whose times overlap.

window 1 window 3

user1

user2

user3
window 2 window 4

time
window size window size

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 338
Common Window Types of Flink (3)

S ession windows, which are considered completed if there is no data within the preset time period.

window 1 window 2 window 3 window 4

user1
window 1 window 2 window 3 window 4
user2
window 1 window 2 window 3

user3

session gap

time

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 339
Fault Tolerance of Flink
The checkpoint mechanism is a key fault tolerance measure of Flink.

The checkpoint mechanism keeps creating status snapshots of stream applications. The status
snapshots of the stream applications are stored at a configurable place (for example, in the memory
of JobManager or on HDFS).

The core of the distributed snapshot mechanism of Flink is the barrier. Barriers are periodically inserted
into data streams and flow as part of the data streams.

New tuple DataStream Old tuple

Checkpoint barrier n Checkpoint barrier n-1

Part of Part of Part of


Checkpoint n+1 Checkpoint Checkpoint n-1

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 340
Checkpoint Mechanism (1)

• The checkpoint mechanism is the reliability pillar stone of


Flink. When an exception occurs on an operator in the Flink
cluster (for example, unexpected exit), the checkpoint
mechanism can restore all application statuses at a previous
time so that all statuses are consistent.

• This mechanism ensures that when a running application fails,


all statuses of the application can be restored from a
checkpoint so that data is processed only once. Alternatively,
you can choose to process data at least once.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 341
Checkpoint Mechanism (2)
Barrier
Source Intermediate Sink
operator operator operator

CheckpointCoordinator

Barrier
Source Intermediate Sink
operator operator operator

CheckpointCoordinator
Snapshot
Barrier
Source Intermediate Sink
operator operator operator

CheckpointCoordinator
Snapshot

Source Intermediate Sink


operator operator operator

CheckpointCoordinator
Snapshot

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 342
Checkpoint Mechanism (3)
Barrier of source A
A
C D
B Barrier of source B

Barrier of source A
A
C D
B Barrier of source B
Snapshot

A Merged barrier
C D
B

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 343
CONTENTS
01 02 03
Flink Overview Technical Principles Flink Integration in
and Architecture of FusionInsight HD
Flink

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 344
Location of Flink in FusionInsight Products
Application service layer

Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive MapReduce Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

• FusionInsight HD provides a Big Data processing environment and selects the best practice in the industry
based on scenarios and open source software enhancement.
• Flink is a unified computing framework that supports both batch processing and stream processing. Flink
provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 345
Flink Web UI

T he FusionInsight HD platform provides a visual management and monitoring UI for Flink. You
can use the YARN Web UI to query the running status of Flink tasks.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 346
Interaction of Flink with Other Components

In the FusionInsight HD cluster, Flink interacts


with the following components:

HDFS
• (mandatory) Flink reads and writes data in HDFS.

YARN
• (mandatory) Flink relies on YARN to schedule and manage
resources for running tasks.

ZooKeeper
• (mandatory) Flink relies on ZooKeeper to implement the
checkpoint mechanism.
Kafka
• (optional) Flink can receive data streams sent from Kafka.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 347
Summary
• These slides describe the following information about Flink: basic concepts, application scenarios,
technical architecture, window types, and Flink on YARN.

• These slides also describe Flink integration in FusionInsight HD.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 348
Quiz

• What are the key features of Flink?


• What are the common window types of Flink?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 349
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 350
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of Loader

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


What Loader is A
What Loader can be used for B
Position of Loader in FusionInsight C
System architecture of Loader D
Objectives
Upon completion of this course, you will be able Main features of Loader E
to know:
How to manage Loader jobs F
How to monitor Loader jobs G
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 353
CONTENTS
01 02
Introduction to Loader Loader Job Management

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 354
01 02
Introduction to Loader Loader Job Management

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 355
What Is Loader

• Loader is a loading tool for data and file exchange between FusionInsight HD and relational
databases and file systems. Loader provides a wizard-based job configuration management
Web UI and supports timed task scheduling and periodic Loader job implementation. On the
Web UI, users can specify multiple data sources, configure data cleaning and conversion
steps and the cluster storage system.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 356
Application Scenarios of Loader

RDB

Hadoop
SFTP Server
Loader • HDFS

FTP Server
• HBase
• Hive

Customized
Data Source

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 357
Position of Loader in FusionInsight

Application service layer


Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Loader Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

Loader is a loading tool for data and file exchange between FusionInsight HD and
relational databases and file systems.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 358
Features of Loader

Loader

GUI
• Provides a GUI that facilitates operations.

Security
• Kerberos authentication.

Highly Reliability
• Deploys Loader Servers in active / standby mode.
• Uses MapReduce to execute jobs and supports
retry after failure. High Performance
• Leaves no junk data after a job failure occurs. • Uses MapReduce for parallel data processing.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 359
Module Architecture of Loader

Loader External Data Source


Loader Client
Tool Web UI JDBC File

REST API JDBC SFTP / FTP

Transform Engine
Job
Execution Engine
Scheduler
Submission Engine Yarn Map Task

Job Manager
HBase
Metadata Repository
HDFS Reduce Task
HA Manager
Hive
Loader Server

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 360
Module Architecture of Loader - Module Description
Module Description
Loader Client Provides a web user interface (Web UI) and a command-line interface (CLI).

Processes operation requests sent from the client, manages connectors and
Loader Server
metadata, submits MapReduce jobs, and monitors MapReduce job status.
Provides a Representational State Transfer (REST ful) interface (HTTP + JSON) to
REST API
process the operation requests from the client.
Job Scheduler Periodically executes Loader jobs.
A data transformation engine that supports field combination, string cutting, and string
Transform Engine
reverse.
Execution Engine Executes Loader jobs in MapReduce manner.
Submission Engine Submits Loader jobs to MapReduce.
Manages Loader jobs, including creating, querying, updating, deleting, activating /
Job Manager
deactivating, starting and stopping jobs.
Metadata warehouse, which stores and manages connectors, conversion steps, and
Metadata Repository
Loader jobs.
Manages the standby and active status of Loader Servers. Two Loader Servers are
HA Manager
deployed in active / standby mode.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 10
361
01 02
Introduction to Loader Loader Job Management

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 362
Service Status Web UI of Loader
• Choose Services > Loader to go to the Loader Status page.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 12
363
Job Management Web UI of Loader

• On the Loader Status page, click Loader Server (Active)


to go to the job management Web UI of Loader.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 364
Job Management Web UI of Loader - Job

• A job describes the process of extracting,


transforming, and loading data from the
data source to the target end. It includes
data source location and attributes, rules for
source-to-target data conversion, and target
end attributes.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 365
Job Management Web UI of Loader - Job
Conversion Rules
Loader Conversion Operators:

• Long Date Conversion: performs long integer and date conversion.

• If Null: converts null values into specified values.

• Add Constants: generates constant fields.

• Generate Random: generates random value fields.

• Concatenate Fields: concatenates existing fields to generate new fields.

• Extracts Fields: separates an existing field by using specified delimiters to generate new fields.

• Modulo Integer: performs modulo operations on existing fields to generate new fields.

• String Cut: cuts existing string fields by the specified start position and end position to generate new fields.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 366
Creating a Loader Job - Basic Information

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 16
367
Creating a Loader Job - From

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 17
368
Creating a Loader Job - Transform

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 18
369
Creating a Loader Job - To

* Storage type HDFS

* File type TEXT_FILE

Compression format Choose…

* Output directory /user/test

File operate type OVERRIDE


Extractor Extractor size

* Number 2

Back Save Save and run Cancel

Copyright
Copyright ©2019
© Huawei
2019Technologies Co., Ltd. All rights reserved.
Huawei Technologies Co., Ltd. All rights reserved. Page
Page 37019
Monitoring Job Execution Status

Check the execution status of all jobs:

• Go to the Loader job management page.

• The page displays all current jobs and last execution status.

• Select a job, and click a button in the Operation column to perform a


corresponding operation.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 371
Monitoring Job Execution Status - Job Execution
History

View execution records of specified jobs:


• Select a job, and then click the History button in the Operation column.

• The historical record page displays the start time, duration (s), status, failure cause,
number of read / written / skipped rows / files, dirty data link, and MapReduce log
link of each execution.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 372
Monitoring Job Execution Status - Dirty Data
Dirty data refers to those that does not meet Loader conversion rules, which can be
checked with the following steps.

• If the number of skipped job records is not 0 on the job history page, click the dirty data button to go to the
dirty data directory.
• Dirty data is stored in HDFS, and the dirty data generated by each Map job is stored in a separate file.

Permission Owner Group Size Last Modified

drwx------ admin hadoop 0B Thu Apr 07 14:13:03 2016

drwx------ admin hadoop 0B Thu Apr 07 14:13:14 2016

drwx------ admin hadoop 0B Thu Apr 07 14:13:15 2016

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 22
373
Monitoring Job Execution Status - MapReduce Log

• On the job history page, click the log button.


The MapReduce log page for the execution
is displayed.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 374
Monitoring Job Execution Status - Job Execution
Failure Alarm

When a job fails to be executed, an alarm


is reported.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 375
Summary
This module describes the following information about Loader: main functions and features,
job management and monitoring.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 376
Quiz

• True or False:
A. FusionInsight Loader supports only data import and export between relational databases and
Hadoop HDFS and HBase.
B. Conversion steps must be configured for Loader jobs.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 377
Quiz

• Which of the following statements are CORRECT?


A. No residual original files are left when a job fails after proper running for some time.

B. Dirty data refers to the data that does not comply with conversion rules.

C. Loader client scripts can only be used to submit jobs.


D. A human-machine account can be used to perform operations on all Loader jobs.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 378
Quiz

• Which of the following statements is CORRECT?


A. If Loader is faulty after it submits a job to MapReduce, the job will fail to be executed.

B. If a Mapper execution fails after Loader submits a job to MapReduce, a second execution is automatically performed.

C. Residual data generated after a Loader job fails to be executed needs to be manually cleared.

D. After Loader submits a job to MapReduce for execution, it cannot submit other jobs before the execution is complete.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 379
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 380
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of Flume

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Foreword

• Flume is an open-source log system, which is a distributed, reliable, and


high-available massive log aggregation system. Flume supports
customization of data senders and receivers for collecting, processing
and transferring data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 383
What Flume is A
Functions of Flume B
Position of Flume
in FusionInsight C
Objectives System architecture
of Flume D
Upon completion of this course, you will be able
to know: Key characteristics
of Flume E
Application Examples
of Flume F

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 384
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 385
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 386
What is Flume

F lume is a streamed log collection tool. Flume can


roughly processes data and writes data to customizable data
receivers. Flume can collect data from various data sources
such as local files (spool directory source), real-time logs
(taildir and exec), REST message, Thrift, Avro, Syslog, and
Kafka.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 387
Functions of Flume

Flume can collect logs from a specified directory and save the logs in a
01 specified path (HDFS, HBase, and Kafka).

02 Flume can collect and save logs (taildir) to a specified path in real time.

Flume supports the cascading mode (multiple Flume nodes interwork with
03 each other) and data aggregation.

04 Flume supports customized data collection.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 388
Position of Flume in FusionInsight
Application service layer

Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Flume Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Storm Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

Flume is a distributed framework for collecting and aggregating


stream data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 389
Architecture of Flume (1)
• Basic Flume architecture: Flume can directly collect data on a single node. This architecture is mainly applicable to
data collection within a cluster.

Log Source Channel Sink HDFS

• Multi-agent architecture of the Flume: Multiple Flume nodes can be connected. After collecting initial data from data
sources, Flume saves the data in the final storage system. This architecture is mainly applicable to the import of data
outside to the cluster.

Source Source

Sink Sink
Log HDFS
Channel Channel

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 390
Architecture of Flume (2)

Interceptor events

Channel

events
events events events
Channel Channel
Source Channel
Processor Selector

events

Sink Sink
Sink
Runner Processor

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 391
Basic Concept - Source (1)

The source receives events or generates events based on


special mechanisms. The source can save events to one
channel or multiple channels in batches. The sources are
classified into event driven sources and event polling sources.

• Event-driven source: The external source actively sends data to Flume to drive Flume to
accept the data.
• Event polling source: Flume periodically obtains data in an active manner.

The source must be associated with at least one channel.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 392
Basic Concept - Source (2)

Source Type Description


Runs a certain command or script, and outputs the execution results
exec source
as a data source.
Provides an Avro-based server. It binds the server with a port so that
avro source
the server waits for the data sent from the Avro-based client.

thrift source The same as the avro source. The transmission protocol is Thrift.

http source Supports data transmission based on HTTP POST.


syslog source Collects the syslog logs.

spooling directory source Collects local static files.

jms source Obtain data from the message queue.

Kafka source Obtain data from the Kafka.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 12
393
Basic Concept - Channel (1)

The channel is located between the source and the sink. The channel functions similar to the
queue. It temporarily saves events. When the sink successfully sends events to the next-hop
channel or the destination, the events are removed from the current channel.

The persistence levels vary with channels.


• Memory channel: The persistence is not supported.
• File channel: The persistence is achieved based on the Write-Ahead Log (WAL).
• JDBC channel: The persistence is achieved based on the embedded database.

Channels support transactions and provide weak sequence assurance. Channels can connect
any quantity of sources and sinks.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 394
Basic Concept - Channel (2)

Memory channel
Messages are saved in the memory. This channel supports high
throughput but no reliability. Data may be lost.

File channel
It supports data persistence. However, the configuration is
complex. Both the data directory and the checkpoint directory
need to be configured. Checkpoint directories need to be
configured for different file channels.

JDBC channel
It is the embedded Derby database. It supports event persistence
and high reliability. It can replace the file channel that also
supports persistence.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 395
Basic Concept - Sink (1)

• The sink transmits events to the next hop or


destination. After the events are successfully
transmitted, they are removed from the current
channel.
• The sink must bind to a specific channel.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 396
Basic Concept - Sink (2)

Sink Type Description


hdfs sink Writes the data in the HDFS.

avro sink Transmits data to the next-hope Flume node using the Avro protocol.

thift sink The same as the avro sink. The transmission protocol is Thrift.

file roll sink Saves data in the local file system.

HBase sink Writes data in the HBase.

Kafka sink Writes data in the Kafka.

MorphlineSolr sink Writes data in the Solr.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 16
397
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 398
Log Collection

• Flume can collect logs beyond a cluster and archive the logs in the HDFS, HBase,
and Kafka for data analysis and cleaning by upper-layer applications.

Log Source Channel Sink HDFS

Log Source Channel Sink HBase

Log Source Channel Sink Kafka

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 399
Multi - level Cascading and Multi - channel Duplication

• Multiple Flume nodes can be cascaded. The cascaded nodes support internal
data duplication.

Source
Channel
Log
Sink

Channel Sink HDFS


Source
Channel Sink HBase

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 400
Message Compression and Encryption by
Cascaded Flume Nodes
• Data transmitted between cascaded Flume nodes can be compressed and encrypted,
thereby improving the data transmission efficiency and security.

Flume

RPC
Compression and
Decompression HDFS / Hive /
encryption
and decryption HBase / Kafka

Flume API

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 401
Data Monitoring

FusionInsight
Flume monitoring information
Manager

Flume
Application
Received data size Transmitted data size

Source Data buffer size Sink HDFS / Hive /


Flume API HBase / Kafka
Channel

Transmitted data size

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 402
Transmission Reliability

• Flume adopts the transaction management mode for data transmission. This mode ensures the
data security and enhances the reliability during transmission. In addition, if the file channel is used
to transmit data buffered in the channel, the data is not lost when a process or node is restarted.

Channel Sink Source Channel

Start tx
Take events
Send events
Start tx

Put events

End tx End tx

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 403
Transmission Reliability (Failover)

• During data transmission, if the next-hop Flume node is faulty or receives data abnormally, the
data is automatically switched over to another path.

Source Sink

HDFS
Source
Channel
Sink
Log
Channel
Source Sink
Sink
HDFS
Channel

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 404
Data Filtering During Transmission

• During data transmission, Flume roughly filters and cleans the data. The unnecessary data is
filtered. In addition, you can develop filter plug-ins based on the data particularity if you need to
filter complex data. Flume supports the third-party filter plug-ins.

Interceptor

Channel

events events
Channel Channel
Source
Processor Selector events

Channel

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 405
CONTENTS
01 02 03
Flume Overview and Key Characteristics of Flume Flume Applications
Architecture

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 406
Flume Example 1 (1)

Data
Description preparations

• In this application scenario, • Create a log directory /


Flume collects logs from an tmp / log_test on a node
application (for example, the in the cluster.
online banking system) • Take this directory as the
outside the cluster and saves monitoring directory.
the logs in the HDFS.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 407
Flume Example 1 (2)

Download the
Flume Client

Log in to the FusionInsight HD cluster. Choose Service Management >


Flume > Download Client.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 408
Flume Example 1 (3)

• Install Flume client:


Decompress the client
Tar -xvf FusionInsight_V100R002C60_Flume_Client.tar
Tar -xvf FusionInsight_V100R002C60_Flume_ClientConfig.tar
Cd FussionInsight_V100R002C60_Flume_ClientConfig/Flume
Tar -xvf FusionInsight-Flume-1.6.0.tar.gz

Install the client


./install.sh -d /opt/FlumeClient -f hostIP -c
flume/conf/client.properties.properties

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 409
Flume Example 1 (4)
• Configure flume source

server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_test
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.deserializer = LINE
server.sources.a1.selector.type = replicating
server.sources.a1.fileHeaderKey = file
server.sources.a1.fileHeader = false
server.sources.a1.channels = ch1

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 410
Flume Example 1 (5)

• Configure flume channel

# the channel configuration of ch1


server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 411
Flume Example 1 (6)
• Configure flume sink

server.sinks.s1.type = hdfs
server.sinks.s1.hdfs.path = /tmp/flume_avro
server.sinks.s1.hdfs.filePrefix = over_%{basename}
server.sinks.s1.hdfs.inUseSuffix = .tmp
server.sinks.s1.hdfs.rollInterval = 30
server.sinks.s1.hdfs.rollSize = 1024
server.sinks.s1.hdfs.rollCount = 10
server.sinks.s1.hdfs.batchSize = 1000
server.sinks.s1.hdfs.fileType = DataStream
server.sinks.s1.hdfs.maxOpenFiles = 5000
server.sinks.s1.hdfs.writeFormat = Writable
server.sinks.s1.hdfs.callTimeout = 10000
server.sinks.s1.hdfs.threadsPoolSize = 10
server.sinks.s1.hdfs.failcount = 10
server.sinks.s1.hdfs.fileCloseByEndEvent = true
server.sinks.s1.channel = ch1

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 412
Flume Example 1 (7)

• Name the configuration file of flume agent properties. properties.


• Upload the configuration file.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 413
Flume Example 1 (8)

01 Move data files to the monitor directory /tmp/log_test:

mv /var/log/log.11 /tmp/log_test

02 Check if data is sinked to HDFS:

hdfs dfs -ls /tmp/flume_avro

03 log. 11 is already renamed log. 11. COMPLETED, which means success of data collection.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 414
Flume Example 2 (1)

• In this application scenario, Flume collects real-time


Description clickstream logs and saves the logs to the Kafka, for real-
time analysis processing.

• Create a log directory /tmp/log_click on a node in the cluster. Data


• Collect data to kafka topic_1028. preparations

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 415
Flume Example 2 (2)
• Configure flume source:
server.sources = a1
server.channels = ch1
server.sinks = s1

# the source configuration of a1


server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_click
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.selector.type = replicating
jserver.sources.a1.basenameHeaderKey = basename
server.sources.a1.deserializer.maxBatchLine = 1
server.sources.a1.deserializer.maxLineLength = 2048
server.sources.a1.channels = ch1

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 416
Flume Example 2 (3)

• Configure flume source:

# the channel configuration of ch1


server.channels.ch1.type = memory
server.channels.ch1.capacity = 10000
server.channels.ch1.transactionCapacity = 1000
server.channels.ch1.channlefullcount = 10
server.channels.ch1.keep-alive = 3
server.channels.ch1.byteCapacityBufferPercentage = 20

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 417
Flume Example 2 (4)

• Configure flume sink:

# the sink configuration of s1


server.sinks.s1.type = org.apache.flume.sink.kafka.KafkaSink
server.sinks.s1.kafka.topic = topic_1028
server.sinks.s1.flumeBatchSize = 1000
server.sinks.s1.kafka.producer.type = sync
server.sinks.s1.kafka.bootstrap.servers = 192.168.225.15:21007
server.sinks.s1.kafka.security.protocol = SASL_PLAINTEXT
server.sinks.s1.requiredAcks = 0
server.sinks.s1.channel = ch1

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 418
Flume Example 2 (5)
• Upload the configuration file to flume.
• Use kafka demands to view data collected kafka topic_1028.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 419
Summary
This course describes Flume functions and application scenarios, including the basic concepts, functions,
reliability, and configuration items. Upon completion of this course, you can understand Flume functions,
application scenarios, and configuration methods.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 420
Quiz

• What is Flume? What are functions of the Flume?


• What are key characteristics of the Flume?
• What are functions of the source, channel, and sink?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 421
Quiz

True or False
• Flume supports cascading. That is, multiple Flume nodes can be cascaded for
data transmission.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 422
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 423
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Technical Principles
of Kafka

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Basic concepts and application scenarios of Kafka
A
Objectives
Upon completion of this course, you will be able
System architecture of Kafka
B
to know:
Key processes of Kafka
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 426
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 427
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 428
Kafka Overview

• Definition of Kafka: Kafka is a high-throughput, distributed, and publishing-


subscription messaging system. A large messaging system can be
established on low-cost servers with Kafka technology.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 429
Kafka Overview

• Definition of Kafka: Kafka is a high-throughput, distributed, and publishing-


subscription messaging system. A large messaging system can be
established on low-cost servers with Kafka technology.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 430
Kafka Overview
Application scenarios
• Compared with other components, Kafka features message persistence, high throughput, distributed processing and real-
time processing. It applies to online and offline message consumption and massive data collection scenarios, such as
website active tracking, operation data monitoring of the aggregation statistics system, and log collection, etc.

Frontend Backend
Producer Producer

Flume Storm

Kafka
Hadoop Spark

Farmer

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 431
Position of Kafka in FusionInsight

Application service layer


Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
M/R Hive Kafka Spark Streaming Solr
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

Kafka is a distributed messaging system that supports online and offline message
processing and provides Java APIs for other components.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 432
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
 Kafka Write Process
 Kafka Read Process

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 433
Kafka Topology

(Producer) Front End Front End Front End Service

(Push) (Push) (Push) (Push)

ZooKeeper
Zoo Keeper
(Kafka) Broker Broker Broker Zoo Keeper

(Pull) (Pull) (Pull) (Pull)

Hadoop Real-time Other Data


(Consumer) Cluster Monitoring Service Warehouse

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 434
Kafka Topics
Consumer group 1
Consumer group 2 A consumer uses offsets to record and
read location information.
Kafka cleans up old messages
based on the time and size.

Kafka topic

… new

Older msgs Newer msgs Producer 1


Producer 2

Producer N
Producer appends messages
at the end of a topic.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 435
Kafka Partition
• Each topic contains one or more partitions. Each partition is an ordered and immutable
sequence of messages. Partitions ensure high throughput capabilities of Kafka.

Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12

Partition 1 0 1 2 3 4 5 6 7 8 9 Writes

Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12

Old New

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 436
Kafka Partition
• Consumer group A has two consumers to read data from four partitions
• Consumer group B has four consumers to read data from four partitions.

Kafka Cluster

Server 1 Server 2

P0 P3 P1 P2

C1 C2 C3 C4 C5 C6

Consumer group A Consumer group B

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 437
Kafka Partition Offset
• The location of a message in a log file is called offset, which is a long integer that uniquely
identifies a message. Consumers use offsets, partitions, and topics to track records.

Consumer
group C1

Partition 0 0 1 2 3 4 5 6 7 8 9 10 11 12

Partition 1 0 1 2 3 4 5 6 7 8 9 Writes

Partition 2 0 1 2 3 4 5 6 7 8 9 10 11 12

Old New

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 438
Kafka Partition Replica (1)

Kafka Cluster

Broker 1 Broker 2 Broker 3 Broker 4

Partition-0 Partition-1 Partition-2 Partition-3

Partition-3 Partition-0 Partition-1 Partition-2

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 439
Kafka Partition Replica (2)

Follower->Leader
Pulls data

ReplicaFetcherThread

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7

writes
old new old new

Leader Partition Follower Partition


ack

Producer

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 440
Kafka Partition Replica (3)
ReplicaFetcherThread

Broker 1 Broker 2 Broker 3


Leader Follower Follower
Partition-0 Partition-0 Partition-0
Leader Follower Follower
Partition-1 Partition-1 Partition-1

… … …

ReplicaFetcherThread ReplicaFetcherThread-1

Broker 1 Broker 2 Broker 3


Leader Leader Follower
Partition-0 Partition-1 Partition-0

… … Follower
Partition-1

ReplicaFetcherThread-2

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 441
Kafka Logs (1)
• A large file in a partition is split into multiple small segments. These segments facilitate
periodical clearing or deletion of consumed files to reduce disk usage.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 442
Kafka Logs (2)
segment file 1
msg-00000000000
in-memory index
msg-00000000215
delete msg-00000000000

……
msg-00014517018
msg-00030706778 msg-00014516809

reads

……
append msg-02050706778

segment file N
msg-02050706778
msg-02050706945

……
msg-02614516809

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 443
Kafka Logs (3)

00000000000000368769.log
Message368770 0
00000000000000368769.index Message368771 139
1,0 Message368772 497
3,497 Message368773 830
6,1407 Message368774 1262
8,1686 Message368775 1407
… Message368776 1508
Message368777 1686
N,position

Message368769+N position

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 444
Kafka Log Cleanup (1)
• Log cleanup modes: delete and compact.
• Threshold for deleting logs: retention time limit and size of all logs in a partition.

Parameter Default Value Description Range

log.cleanup.policy delete Outdated log cleanup policy. Delete or compact

Maximum retention time of log


log.retention.hours 168 1 ~ 2147483647
files. Unit: hour.

Maximum size of log data in a


log.retention.bytes -1 partition. By default, the value is -1 ~ 9223372036854775807
not restricted. Unit: byte.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 20
445
Kafka Log Cleanup (2)

Offset 0 1 2 3 4 5 6 7 8 9 10
Log
Key K1 K2 K1 K1 K3 K2 K4 K5 K5 K2 K6 Before
Compaction
Value V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11

Compaction

3 4 6 8 9 10

Keys K1 K3 K4 K5 K2 K6 Log
After
Values V4 V5 V7 V9 V10 V11 Compaction

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 446
Kafka Data Reliability

• All Kafka messages are stored in hard disks and topic


partition replication is performed to ensure data reliability.
• How data reliability is ensured during message delivery?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 447
Message Delivery Semantics

There are three data delivery modes:

At Most Once
• Messages may be lost.
• Messages are never redelivered or reprocessed.

At Least Once
• Messages are never lost.
• Messages may be redelivered and reprocessed.

Exactly Once
• Messages are never lost.
• Messages are processed only once.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 448
Kafka Message Delivery

• Messages are delivered in different modes to ensure reliability in different


application scenarios.

Synchronous Asynchronous Asynchronous Asynchronous


Synchronous
delivery delivery delivery with delivery with
delivery with
without without confirmation confirmation
confirmation
confirmation confirmation but no retries and retries
No replicas At most once At least once At most once At least once At least once
Synchronous
replication At most once At least once At most once At least once At least once
(leader and followers)

Asynchronous
Messages may be Messages may be Messages may be
replication At most once At most once
(leader) lost or repeated lost or repeated lost or repeated

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 24
449
Kafka Cluster Mirroring

ZooKeeper
ZooKeeper

Kakfa Broker ZooKeeper

Producers Kafka Broker


consumer

Source Cluster Data


producer
Data
Mirror Maker

Target Cluster

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 450
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 451
Write Data by Producer

Data Create
Data
Message

Publish
Message
Producer

Message

Kafka Cluster

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 452
CONTENTS
01 02 03
Introduction to Architecture and Key Processes of
Kafka Functions of Kafka Kafka
• Kafka Write Process
• Kafka Read Process

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 453
Read Data by Consumer

Overall process: Process


Data
Data
• A consumer connects to the leader Message
broker where the specified topic
partition is located and pulls Subscrible
messages from Kafka logs. Message
Consumer

Message

Kafka Cluster

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 454
Summary
This module describes the following information about Kafka: basic concepts and application
scenarios, system architecture and key processes.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 455
Quiz

• Which of the following are features of Kafka?


A. High throughput.
B. Distributed.
C. Data persistence.
D. Random message read.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 456
Quiz

• What is the component that Kafka directly depends on for running?

A. HDFS.
B. ZooKeeper.
C. HBase.
D. Spark.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 457
Quiz

• How is Kafka data reliability ensured?


• What operations can the shell commands provided by the Kafka client be
used to perform on the topics?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 458
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 459
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


ZooKeeper Cluster Distributed
Coordination Service

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Concepts of ZooKeeper
A
System architecture
Objectives
Upon completion of this course, you will be able
of ZooKeeper B
to know:
Key features of ZooKeeper
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 462
CONTENTS
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 463
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 464
ZooKeeper Overview

ZooKeeper, a distributed service framework,


provides distributed and highly available service
coordination capabilities and is designed to
resolve data management issues
in distributed applications.

ZooKeeper works depending on Kerberos and


LdapServer in security mode but does not depend
on them in non-security mode. As an underlying
component, ZooKeeper is used by upper-layer
components, such as Kafka, HDFS, HBase and
Storm.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 465
ZooKeeper Overview
Hadoop Ecosystem

Apache Interactive Analysis Stream Processing Apache


Drill Storm

Transter

HiveQL
Language

Learning
Machine
Data

Scripting

Query
PIG
Coordination Service

Column Datastore
Sqoop
ZooKeeper Mahout Hive
(Unstuctured)
Streaming

Hadoop
MapReduce Core Hadoop
Data

Hadoop HDFS Core Hadoop HBase


Flume

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 466
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 467
Position of ZooKeeper in FusionInsight
Application service layer
Open API / SDK REST / SNMP / Syslog

Data Information Knowledge Wisdom


DataFarm Porter Miner Farmer Manager

System
Hadoop API Plugin API management

Service
Hive M/R Spark Streaming Flink
governance
Hadoop YARN / ZooKeeper LibrA Security
management
HDFS / HBase

Based on the open source Apache ZooKeeper, ZooKeeper provides services for
upper-layer components and is designed to resolve data management issues in
distributed applications.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 468
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 469
ZooKeeper Service Architecture - Model
ZooKeeper Service
Leader

Server Server Server Server Server

Client Client Client Client Client Client Client Client

• A ZooKeeper cluster is a group of servers. In this group, one server functions as the leader and the other servers are followers.
• ZooKeeper selects a server as the leader upon startup.
• ZooKeeper uses a user-defined protocol named ZooKeeper Atomic Broadcast (ZAB), which ensures the data consistency among nodes in the system.
• After receiving a data change request, the leader first writes the change to local disks ,then to memory for restoration.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 470
ZooKeeper Service Architecture - Disaster
Recovery (DR)

ZooKeeper can select a server as the


leader and provide services correctly.
• An instance that wins more than half of the votes during the election
becomes the leader.

For n instances, n could be odd or even.


• If n = 2x + 1, the node that functions as the leader must win x + 1 votes and the
DR capability is x.
• If n = 2x + 2, the node that functions as the leader must win x + 2 votes and the
DR capability is x.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 471
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 472
Key Features of ZooKeeper

• Eventual consistency: All servers are displayed in the same view.

• Real-time capability: Clients can obtain server updates and failures within a
specified period of time.
• Reliability: A message will be received by all servers.
• Wait-free: Slow or faulty clients cannot intervene the requests of rapid clients so that
the requests of each client can be processed effectively.

• Atomicity: Data transfer either succeeds or fails, but no transaction is partial.

• Sequence: The sequence of data status updates on clients is consistent with that of
request sending.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 473
Read Progress of ZooKeeper

• ZooKeeper consistency indicates that all servers connected to a client are displayed in the same view.
Therefore, read operations can be performed between the client and any server.

ZK 1 (F) ZK 2 (L) ZK 3 (F)

Local Storage Local Storage Local Storage

1.Read Request 2.Read Response

Client

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 474
Write Progress of ZooKeeper

3.2.Send Proposal
2.Write Request
4.1
ZK 1 (F) 3.1 ZK 2 (L) ZK 3 (F)
3.3.Send Proposal
4.2.ACK
4.3.ACK
5.1
5.3.Commit
Local Storage 5.2.Commit Local Storage Local Storage

1.Write Request

Client
6.Write Response

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 475
ACL (Access Control List)

The access control list (ACL) controls the access to ZooKeeper. It applies to specified znodes and
cannot be applied to all subnodes of the znodes. Run the setAcl / znode scheme:id:perm command to
set the ACL.

Scheme indicates the authentication mode. ZooKeeper provides four authentication modes:
• World: an ID. Any person can access ZooKeeper.
• auth: does not use any ID. Only authenticated users can access ZooKeeper.
• digest: uses the MD5 hash value generated by username : password as the authentication ID.
• IP: uses the client host IP address for authentication.

Id: checks whether authentication information is valid. The authentication methods varies with
different scheme.

Perm: indicates the permission that a user who passes ACL authentication can have for
ZooKeeper.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 476
Log Enhancement

• Ephemeral node exists as long as the session


that created the node is active. Ephemeral node
deletion is recorded in audit logs so that
ephemeral node status can be obtained.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 477
Commands for ZooKeeper Clients

• Invoke a ZooKeeper client: zkCli.sh-server 172.16.0.1:24002

• Create a node: create /node

• Obtain the subnodes of a node: ls /node

• Create node data: set /node data

• Obtain node data: get /node

• Delete a node: delete /node

• Delete a node and all subnodes: deleteall /node

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 478
01 02 03 04 05
Introduction Position of System Key Features Relationship with
to ZooKeeper ZooKeeper in Architecture Other Components
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 479
ZooKeeper and Streaming

ZooKeeper cluster

Streaming cluster

Active Nimbus Standby Nimbus

Supervisor Supervisor Supervisor



Worker Worker Worker Worker Worker Worker

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 480
ZooKeeper and HDFS

ZooKeeper
Cluster

Create NameNode NameNode Standby


Share directory firstly Monitor the data message
to be the Active of Active share directory

ZKFC
NameNode
NameNode Standby
Active

Write / Read Message

Monitoring

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 481
ZooKeeper and YARN

ZooKeeper
Cluster

Active ResourceManager Write the select message Standby ResourceManager Standby ResourceManager
Create Statestore directory to ZooKeeper firstly Monitor the select Get the message from
In ZooKeeper to be Active message of Active Statestore directory

Active ResourceManager Standby ResourceManager

Write / Read Message

Monitoring

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 482
ZooKeeper and HBase

ZooKeeper
Cluster

Write HMaster message HMaster Standby RegionServer


HMaster Active monitor
to HMaster directory monitor the folder Write its own state message
RegionServer
Firstly to be The Active in the Active directory to ZooKeeper

HMaster Active HMaster Standby

RegionServer RegionServer … RegionServer

Write / Read Message

Monitoring

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 483
Summary
This module describes the following information about ZooKeeper:
• Functions and position in FusionInsight.
• Service architecture and data models.
• Read and write progresses as well as consistency.
• Creating and permission setting of ZooKeeper nodes.
• Relationship with other components.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 484
Quiz

• What are the functions and position of ZooKeeper in a cluster?

• Why does ZooKeeper need to be deployed on an odd number of nodes?

• What does ZooKeeper consistency mean?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 485
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 486
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


FusionInsight HD
Solution Overview

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.


Huawei big data solution FusionInsight
HD A
Objectives
After completing this course, you will be able to
The features of FusionInsight HD
B
understand:
Success cases of FusionInsight HD
C
Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 489
CONTENTS
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 490
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 491
Apache Hadoop - Prosperous Open - Source Ecosystem
Hadoop Ecosystem Map

1 2 10 Workflow 12 13 More High Level


Firefox Safar
Interfaces
Netscape NUTCH Cascading Support
mahout amazon
Internet Explorer
Engine + High Level
6 Unstructured Data 5 8 4
Logic Interfaces
Hadoop HDFS Hadoop HDFS
Flume
JAQL
3 File system 9
Scribe Hive
Hadoop HDFS Intellicus Dashboards

7 Structured Data 14 OLTP 11 Monitor / Manage Hadoop Ecosystem


OLTP
RDBMS hiho Sqoop HBase Hue eclipse Karmasphere Ganglia

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 492
Big Data Is an Important Pillar for Huawei ICT Strategy
Huawei Big Data R & D
Huawei Strategy Map
Team Global Distribution
Content
and App
Third Third ISVs
Partners
Professional Service

Enterprise SDP BSS / OSS


Apps
Big Data Analytics Platform
Data Center Infrastructure
Core Network
IP+Optical
Enterprise
FBB MBB
Network • There are 8 research centers with thousands of
employees around the world.
Things (M2M Module) People (Smart Device) • World-class data mining and artificial intelligence
experts, such as PMC Committer and IEEE
Source: Huawei corporate presentation Fellow.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 493
FusionInsight HD: From Open - Source to Enterprise Versions

Version
Security Configuration
mapping

Easy
-to-Use
Performance Baseline
Patch selection
optimization selection

Security
Hadoop HBase Log

Enterprise
Reliability version
Prosperous
community
Initial
open-source

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 494
FusionInsight Platform Architecture

Safe city Power industry Financial industry Telecom Big data cloud services

Big data cloud services


Data integration services Data processing services Real-time computing Data analysis services Machine learning services Artificial Intelligence
services Service (AIS)
Data Ingest services MapReduce Service (MRS) Stream services DWS MLS Image tagging service
DPS services, ... CloudTable RTD services, ... MOLAP services, ... Log analysis, ... NLP service, ...

FusionInsight Porter FusionInsight Miner data insight FusionInsight Farmer data intelligence FusionInsight Manager
Data integration Management platform
Weaver graphics analysis engine RTD real-time decision engine

Sqoop Miner Studio mining platform Farmer Base reasoning framework Security
Batch collection management

Performance
FusionInsight HD data processing
Flume Real-time management
collection Spark Storm / Flink
FusionInsight Elk

Collaboration service
One-stop analysis Stream processing Fault management
Standard SQL engine
framework framework FusionInsight

ZooKeeper
Kafka LibrA
Message queue Yarn resource management Parallel database Tenant management
CarbonData new file format HBase
Oozie NoSQL database Configuration
Job scheduling HDFS distributed file system management

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 495
Contribution to the Open - Source Community

Create top
Lead the community projects
community to and be recognized
complete future- by the ecosystem
oriented kernel-
level feature
Perform kernel- development
level development
to support key
service features
Be able to resolve
Be able to resolve kernel-level
kernel-level problems by teams
problems
(outstanding
individuals)

Locate peripheral
problems Large number of components and codes

Be able to use Apache open-source Frequent component update


Hadoop community ecosystem
Efficient feature integration

• Outstanding product development and delivery capabilities and carrier-class operation support capabilities empowered by the
Hadoop kernel team.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 496
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 497
System and Data Reliability

System Reliability
• All components without SPOF.
• HA for all management nodes.
• Software and hardware health status monitoring.
• Network plane isolation.

Data Reliability
• Cross-data center DR.
• Third-party backup system integration.
• Key data power-off protection.
• Hot-swappable hard disks.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 498
System and Data Reliability

System Reliability
• All components without SPOF.
• HA for all management nodes.
• Software and hardware health status monitoring.
• Network plane isolation.

Data Reliability
• Cross-data center DR.
• Third-party backup system integration.
• Key data power-off protection.
• Hot-swappable hard disks.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 499
Security

System Permission Data


Security Authentication Security

• Fully open-source Component • Authentication management • Data integrity verification.


enhancement. of user permission.
• File data encryption.
• Operating system security • User permission control of
hardening. different components.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 500
Network Security and Reliability - Dual - Plane
Networking

Network Type Trustworthiness Description


App-Server App-Server
Cluster service
plane Hadoop cluster core
components for the
Cluster service plane High
OMS-Server storage and transfer
of service data
Cluster
management plane
It only manages the
Cluster management plane Medium cluster and is involved
Web UI-Client
with no service data

Only web services


Maintenance network Maintenance network
Outside the cluster Low provided by the OMS
outside the cluster
server can be accessed

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 501
Visualized Cluster Management, Simplifying O&M

Good

Bad

Unknown

Good

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 502
Graphical Health Check Tool (1)

Check item pass rate

28%
Check item failure rate

72%
Check item pass rate

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 503
Graphical Health Check Tool (2)

Qualification ratio of Node qualification rate


inspection items
12% 100%
100%
80%

88% 60%
40%
20% 0%
0%
FusionInsight

 Qualification ratio of inspection items  Node qualification rate


 Disqualification ratio of inspection items  Node disqualification rate

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 504
Easy Development
Native APIs of
Enhanced APIs
HBase
try { try {
table = new HTable(conf, TABLE); table = new ClusterTable(conf, CLUSTER_TABLE);
// 1. Generate RowKey. // 1. Create CTRow instance.
{......} CTRow row = new CTRow();
// 2. Create Put instance. // 2. Add columns.
Put put = new Put(rowKey); {........}
// 3. Convert columns into qualifiers(Need to consider merging } // 3. Put into HBase.
cold columns). table.put(TABLE, row);
// 3.1. Add hot columns. } catch (IOException e) {
{.......} // Does not care connection re-creation.
// 3.2. Merge cold columns.
{.......} Enhanced HBase SDK
put.add(COLUMN_FAMILY, Bytes.toBytes("QA"), hotCol); HBase
// 3.3. Add cold columns. Recoverable Schema table design
put.add(COLUMN_FAMILY, Bytes.toBytes("QB"), coldCols) Connection Manager Data tool

The HBase table design tool,


connection pool management
HBase API
function, and enhanced SDK are
used to simplify development of HBase
complex data tables.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 505
FusionInsight Spark SQL

SQL compatibility: Long-term stability test:


• All 99 TPC-DS cases of the standard SQL:2003 are • Memory optimization-resolves memory leakage problems,
passed. decentralizes broadcasting, and optimizes Spark heap memory.
Data update and deletion:
• Communication optimization-RPC enhancement, shuffle fetch
• Spark SQL supports data insertion, update, and deletion optimization, and shuffle network configuration.
when the CarbonData file format is used.
• Scheduling optimization-GetSplits ( ), AddPendingTask ( )
Large-scale Spark with stable and high
acceleration ( ), DAG serialization reuse.
performance :
• Is used to test the TPC-DS long-term stability in the scale • Extreme pressure test-24/7 pressure test, HA test.
of 100 TB data volume.
• O&M enhancement-Log security review and DAG UI optimization.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 506
Spark SQL Multi - Tenant

JDBCServer (Proxy) Yarn

YarnQuery Tenant A
Spark JDBC
JDBC Proxy 1 Spark JDBCServer 1
Beeline
Spark JDBC Spark JDBCServer 2
Proxy 2
JDBC YarnQuery Tenant B
Beeline Spark JDBC
Proxy X
Spark JDBCServer 1

...
Spark JDBCServer 2

• The community's Spark JDBCServer supports only single tenants. A tenant is bound to a Yarn resource queue.
• FusionInsight Spark JDBCServer supports multiple tenants, and resources are isolated among
different tenants.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 507
Spark SQL Small File Optimization

1 MB+1 MB … 1 MB+1 MB … RDD2

coalesce coalesce coalesce coalesce coalesce coalesce

1 MB 1 MB 1 MB … 1 MB 1 MB 1 MB … RDD1

1 MB 1 MB 1 MB … 1 MB 1 MB 1 MB … HDFS

Text / Parquet / ORC / Json Table on HDFS

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 508
Apache CarbonData - Converging Data Formats
of Data Warehouse (1)
OLAP
( multidimensional analysis ) CarbonData:
A single file format meets the
requirements of different access types.

Sequential access Random access


(large-scale scanning) (small range scanning)
SQL

Hive Engine Spark-SQL


SQL support SQL support
• Random access (small-scale scanning): 7.9
Execution
Distrided

to 688 times.
MapReduce Spark Flink
• OLAP / Interactive query: 20 to 33 times.
• Sequential access (large-scale scanning) :
Storage

ORC File Parquet File CarbonData File


1.4 to 6 times.
Columnar Columnar Full indexed,
Storage Storage hybrid storage

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 509
Apache CarbonData - Converging Data Formats
of Data Warehouse (2)
• Apache Incubator Project since June 2016 CarbonData
• Apache releases
Compute
4 stable releases
Latest 1.0.0, Jan 28, 2017
Apache Spark Flink HIVE

• Contributors:

Alibaba Group ebay HUAWEI hulu InMobi Storage


BANK OF
Intel Letv Meituanwaimai Talend COMMUNICATIONS
CarbonData
• In Production:
BANK OF Hadoop
COMMUNICATIONS HUAWEI hulu

CarbonData supports IUD statements and provides data update and deletion capabilities in
big data scenarios. Pre-generated dictionaries and batch sort improve CarbonData import
efficiency while global sort improves query efficiency and concurrency.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 510
CarbonData Enhancement

GUI More Analysis Tools

Thrift Server
Spark SQL
Spark

Other CARBON
Data
Sources File Format

• Quick query response: CarbonData features high-performance query. The query speed of CarbonData is ten times that
of Spark SQL. The dedicated data format used by CarbonData is designed based on high-performance queries,
including multiple index technologies, global dictionary codes, and multiple push down optimizations, thereby quickly
responding to TB-level data queries.
• Efficient data compression: CarbonData compresses data by combining the lightweight and heavyweight compression
algorithms. This compression method saves 60% to 80% data storage spaces coupled with significant hardware storage
cost savings.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 511
Flink - Distributed Real - Time
Processing System

F link is a distributed real-time processing system with low latency (latency


measured in milliseconds), high throughput, and high reliability, which is
promoted by Huawei in the IT field. Flink is integrated into FusionInsight HD for
sale.

Flink is a unified computing framework that supports both batch processing


and stream processing. It provides a stream data processing engine that
supports data distribution and parallel computing. Flink features stream
processing and is a top open-source stream processing engine in the industry.
Flink is suitable for low-latency data processing scenarios. Flink provides high-
concurrency pipeline data processing, millisecond-level latency, and high
reliability.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 512
Visible HBase Modeling

Column Family Column Family


A collection of columns A collection of columns that have
that have service service association relationships.
association relationships.

Column
User list:
Each column Qualifier
indicates an HBase column
attribute of Mapping Each column indicates a KeyValue.
service data.

Reverse (Column1, 4) Column2 Column3

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 513
HBase Cold Field Merging Transparent to
Applications
User Data

ID Name Phone ColA ColB ColC ColD ColE ColF ColG ColH

A B C D
HBase KeyValues

Problems
• High expansion rate and poor data query performance due to the HBase column increase.
• Increased development complexity and metadata maintenance due to the application layer
merging cold data columns.

Features
• Cold field merging transparent to applications.
• Real-time write and batch import interfaces.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 514
Hive / HBase Fine - Grained Encryption

Application scenarios
Hive / HBase
• Data saved in plaintext mode may cause security risks of
Sensitive Insensitive sensitive data leakage.
Sensitive
data write data read data
Solution
• Hive encryption of tables and columns.
• HBase encryption of tables, column families, and columns.
Encryption / Decryption • Encryption algorithms of AES and SM4, and user-defined
encryption algorithms.

HDFS Customer benefits


*(&@#$^%!%$#$!(*^&*^*5 Insensitive • Sensitive data is encrypted and stored by table or column.
!$!@^%$^!$!%#$@%#!!$ data
#@! • Algorithm diversity and system security.
• Encryption and decryption transparency to services.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 515
HBase Secondary Indexing

UserTable UserTable_idx UserTable


ColumnFamily CF ColumnFamily
RowKey RowKey RowKey
colA colB colC Data Scanning area colA colB colC
a0001 01 a0001#coluA01#a0001 a0001 01
a0002 02 a0001#coluA02#a0002 a0002 02
a0003 06 a0001#coluA03#a0006 a0003 06
a0004 08 a0001#coluA04#a0005 a0004 08
Destination line
a0005 04 a0001#coluA06#a0003 a0005 04
a0006 03 B C a0001#coluA08#a0004 a0006 03 B C

No index: “Scan+Filter”, scanning a large Secondary index: The target data can be located
amount of data. after twice I/Os.

• Index Region and Data Region as companions under a unified processing mechanism.
• Original HBase API interfaces, user-friendly.
• Coprocessor-based plug-ins, easy to upgrade.
• Write optimization, supporting real-time write.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 516
CTBase Simplifies HBase Multi - Table Service Development

Transaction CTBase

Account_id Amount Time AccountInfo


A0001 Andy $100232
record
12/12/2014
A0001 $100 12/12/2014
18:00:02 A0001 $100
18:00:02
10/12/2014
A0001 $1020 Transaction
15:30:05 10/12/2014
A0001 $1020
15:30:05 record
09/12/2014
A0001 $89
13:00:07 09/12/2014
A0001 $89
13:00:07
11/12/2014
A0002 $105
20:15:00
A0002 Lily $902323
AccountInfo
11/12/2014
Account_ Account_ A0002 $105
Account_id 20:15:00
name balance
11/11/2014
A0001 Andy $100232 A0002 $129
18:15:00
A0002 Lily $902323
A0003 Selin $90000
A0003 Selin $90000

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 517
HFS Small File Storage and Retrieval Engine

File File File File File File File File


Application scenario
File File File File File File File File
• A large number of small files and associated description information
File File File needs to be stored.
Files
Current problem
HBase FileStream • A large number of small files are stored in the Hadoop Distributed File
(HFS) System (HDFS), which brings great pressure to the NameNode. HBase
Metadata and Medium / Large
A1 small files files
B1 stores a large number of small files, and Compaction wastes I/O
resources.
HBase Raw API
File
A2 HFS solution value
File
• The HFS stores not only small files but also metadata description
META File File information related to the files.
Data HDFS
File File
• The HFS provides a unified and friendly access API.
MOB
HFile (MOB) • The HFS selects the optimal storage solution based on the file size.
HBase Small files are directly stored in the Medium-sized Objects (MOB).
Large files are directly stored in the HDFS.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 518
Label - based Storage
The data of online applications is stored only on nodes
labeled with "Online Application" and is isolated from the
I/O conflicts affect online services. data of offline applications. This design prevents I/O
competition and improves the local hit ratio.

Online Offline Online Offline


application application application application

processing

processing
processing
application

application
application
Online

Online
Online

Batch

Batch
Batch
HDFS common storage HDFS label-based storage

• Solution description: Label cluster nodes based on applications or physical characteristics, for example, label a
node with “Online Application.” Then application data is stored only on nodes with specified labels.
• Application scenarios:
1. Online and offline applications share a cluster.
2. Specific services (such as online applications) run on specific nodes.
• Customer benefits:
1. I/Os of different applications are isolated to ensure the application SLA.
2. The system performance is improved by improving the hit ratio of application data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 519
Label - based Scheduling

Spark MapReduce Spark MapReduce


application application application application

Large memory

Large memory
memory

Default

Default

Default
Large
Common scheduling Label-based scheduling

Fine-grained scheduling based on application awareness, improving resource utilization


• Different applications such as online and batch processing are running on nodes with their specific labels to
absolutely isolate computing resources of different applications and improve service SLA.
• Applications that have special requirements on node hardware are running only on nodes with special
hardware, for example, Spark applications need to run on nodes with large memory. Resources are scheduled
on demand, improving resource utilization and system performance.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 520
CPU Resource Configuration Period Adjustment
Batch processing application Real-time application Batch processing application Real-time application

Hive / Spark / … Hive / Spark / …


HBase HBase
QA QB QC QD QA QB QC QD

CPU CPU

Cgroup1 40% Cgroup2 60% Cgroup1 80% Cgroup2 20%

7:00 Time
20:00

• Solution description: Different services have different proportions of resources in different time segments. For
example, from 7:00 a.m. to 20: 00 p.m., real-time services can be allocated to 60% resources at peak hours. From
20:00 p.m. to 7: 00a.m., the 80% resource can be allocated to the batch processing applications when the real-time
services are at off-peak hours.
• Application scenario: The peak hours and off-peak hours of different services are different.
• Customer benefit: Services can obtain as many resources as possible at peak hours, boosting the average resource
utilization of the system.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 521
Resource Distribution Monitoring
Remaining HDFS Capacity ?

Unit:GB • Hacluster Ramaining HDF8 Capacity

1705.00

1700.00

1695.00

1690.00

1685.00

1680.00

1675.00
04-27 21:05:00 04-27 21:15:00 04-27 21:30:00 04-27 21:45:00 04-27 22:00:00 04-27 22:05:00

Benefits
• Quick focusing on the most critical resource consumption.
• Quick locating of the node with the highest resource consumption to take appropriate measures.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 522
Dynamic Adjustment of the Log Level

• Application scenario: When a fault occurs in the Hadoop cluster, quickly locating the fault needs to change the log level. During
log level modification, the process cannot be restarted, resulting in service interruption.
How do I resolve this problem?
• Solution: Dynamically adjusting the log level on the Web UI.
• Benefits: When locating a fault, you can quickly change the log level of a specified service or node without restarting the service
or interrupting services.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 523
Wizard - based Cluster Data Backup

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Wizard - based Cluster Data Restoration

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Multi Tenant Management

Multi-level tenant management

Company Enterprise tenant

Dept. A
Tenant A_1 Tenant A

Sub-department A_1

Computing resources Yarn queue (CPU / memory / I/O)

Storage resources HDFS (storage space / file overview)

Service resources HBase ...

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 526
One Stop Tenant Management

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Visualized, Centralized User Rights Management
Visualized, centralized user rights management is easy to use, flexible, and refined:
• Easy to use: visualized multi-component unified user rights management.
• Flexible: role-based access control (RBAC) and predefined privilege sets (roles) which can be used repeatedly.
• Refined: multi-level (database / table / column-level) and fine-grained (Select / Delete / Update / Insert / Grant)
authorization.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 528
Automatic NTP Configuration

External NTP Server

NTP Client
Management Management
Node (Active) Node (Standby)
NTP Server NTP Client

NTP Client NTP Client NTP Client NTP Client NTP Client
DataNode DataNode DataNode DataNode Control Node

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 529
Automatically Configuring Mapping of Hosts

Benefits
• Shorten environment preparation time to install
the Hadoop cluster.
• Reduce probability of user configuration errors.
• Reduce the risk of manually configuring mapping
for stable running nodes after capacity expansion
in a large-scale cluster.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 530
Rolling Restart / Upgrade / Patch

HDFS rolling upgrade example: Upgrade Without


Service Interrupting
• Modifying a Configuration Service interruption duration of core
Services
• Performing the Upgrade components: no interruption in 12
• Installing the Patch hours ZooKeeper
HDFS
Yarn
C70 Client
C60
HBase 95
Storm
HDFS Cluster Flume
NameNode NameNode Loader
Spark
Hive
DataNode DataNode DataNode DataNode DataNode
Solr

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 531
01 02 03
FusionInsight Overview FusionInsight Features Success Cases of
FusionInsight

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 532
Huawei Smart Transportation Solution

Secure Organized
• Challenges to key vehicle identification: insufficient • Challenges to checkpoint and e-police capabilities:
capability of key vehicle automatic identification. rigid algorithm.
• Insufficient traffic accident detection capability: blind spot, • Challenges to violation review and handling
weak detection technology, and manual accident capabilities: heavy workload.
reporting and handling.
• Challenges to special attack data. analysis
• Low efficiency of special attacks: information capabilities: manual analysis and taking 7-30 days.
fragmentation and poor special attack platform.

Smooth Intelligent
• Challenges to traffic detection capability: faulty detection • Computing intelligence challenges: closed system and
devices, low detection efficiency, and low reliable technology and fragmented information.
detection results. • Perceptual intelligence challenges: weak awareness of
• Challenges to traffic analysis capabilities: not shared traffic, events, and peccancy.
traffic information among cities. • Cognitive intelligence challenges: lack of traffic
• Challenges to traffic signal optimization. awareness in regions and intersections.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 533
Traffic Awareness in the Whole City: Deep Learning and
Digital Transformation
• No camera is added. By deep learning and intelligent analysis, about 50 billion real-time pavement traffic parameters are
added every month, which lays a foundation for digital transformation of traffic.

Vehicle traffic and Traffic flow


event awareness analysis

Traffic accident
Traffic
perception and
signal
analysis
optimization

Deep learning Algorithm warehouse


platform Deep learning training Deep learning reasoning Deep learning search engine
engine engine
Video cloud storage and cloud Traffic big data attacks modeling engine and
computing platform time and space analysis engine

More than 3000 channels of


Monitoring more than 6000 roads More than 4000 traffic checkpoints
HD e-police
Note: The preceding figures use a city as an example

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 534
Traffic Big Data Analysis Platform

Key vehicle Key vehicle


traffic analysis violation analysis
Number of vehicles Number of vehicles
(400 million) (400 million)
+pass records +illegal records
(12.6 billion) (2.6 billion)
National transportation
integrated command

Detection Buy and sell Serving 400 million vehicles in provinces


replacement analysis analysis and cities in China, the traffic big data
Number of vehicles Number of vehicles analysis platform analyzes 2.6 billion illegal
(400 million) ( 400 million ) records and 12.6 billion traffic records,
+ illegal records (2.6 billion) + illegal records greatly improving the security and orderly
+ detection records (2.6 billion)
(1.1 billion) + number of drivers who management capability of cross-province
(20 minutes) cleared the license point traffic and reaching the world's leading level.
(110 million)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 535
Limitations of Traditional Marketing Systems

• Customer groups are obtained


• Advertisements can be pushed
through data collection and
only according to the preset
filtering, which is time-
rules.
consuming and labor-
• Real-time marketing by event
consuming.
or location cannot be
• Precise sales cannot
implemented.
be implemented.
Low accuracy
Non–real-
time
• Mainly structured data, unable to
handle semi-structured data. • Marketing strategies and rules
• Customer behavior involved in are fixed. New rules need to be
rule operation and configurations, developed and implemented.
low support rate.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 536
Marketing System Architecture

Application Marketing Marketing Statistical Scheduling


Marketing plan ...
layer execution analysis analysis monitoring

Event detection Recommendation


Model layer Marketing model Rule engine
model engine

Ark Chinasoft big data middleware (Ark)

Huawei enterprise-class big data platform (FusionInsight)


Real-time stream Offline processing component FusionInsight Farmer RTD
processing component

ZooKeeper
ZooKeeper
Big data Flume Spark Loader Hive Farmer
MPPDB
platform Storm / Flink HBase MapReduce
Kafka Redis HDFS / Yarn RTD MQ Redis

Manager

Infrastructure /
Cloud platform x86 server ... x86 server Network device Security device

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 537
Big Data Analysis, Mining, and Machine Learning Make
Marketing More Accurate

Predictive Model Model effect


Data analysis
modeling application monitoring and evaluation

Effect evaluation
Marketing
Data source and continuous
activity plan
optimization

Customer
group filtering Multiple
SMS
Marketing channels
Customer data activity App
Twitter
Correlation
analysis
Analysis report

Model effect evaluation, customer data update, and model improvement.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 538
Solution Benefits
Precise: precise customer
Easy to use: self-learning of rules
group mining

• Customer-based 360-degree view. • Customizable / Development


• Customer type-based mining. variables, rules, and rule modes.
• Rule auto-learning and optimization.

Comprehensive: supporting Precise


Reliable: uninterrupted services
various types of data marketing
• Support of various types of data
(structured, unstructured, and
semi-structured). • Always-on service.
• Support of multi-channel
comprehensive analysis.
• Support of statistics analysis.
Real-time: real-time marketing
information push
• Event-based.
• Location-based.
• Millisecond-level analysis based on
full data.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 539
A Carrier: Big Data Convergence to Achieve Big Values

Crowd Credit Service Internet


investigation experience .. access log Signaling log Domain name ..
gathering computing quality . query query log query .

Real-time query platform


Basic analysis platform
Hadoop resource pool
Hive Spark SQL
KV interface SQL
MapReduce Spark interface

Manager
Yarn / ZooKeeper
Yarn / ZooKeeper
HBase
HDFS

ETL
Data source

Traditional data (BOM) New data (Internet)

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 540
Philippine PLDT: Converting and Archiving Massive
CDRs
Report / Interactive analysis / Forecast analysis / Text mining CSP

Data Federation

DWH Aggregation Hadoop Archiving CSSD

Periodically obtain the source file from the transit server, convert the files to the T0 / T1
format, and upload the converted files to the CSSD / DWH server.
Structured Data Unstructured Data
Mobile Social Voice
SUN NSN E / / / PLP ODS ... AURA Internet Media to Text
... ...

Hadoop stores original CDRs and structured and unstructured data, improving storage capacity
and processing performance, and reducing hardware costs.
A total of 1.1 billion records (664300 MB) are extracted, converted, and loaded at an overall
processing speed of 113 MB/s, much higher than the 11 MB/s expected by the customer.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 541
Summary
These slides describe the enterprise edition of Huawei FusionInsight HD, focus on FusionInsight HD features
and application scenarios, and describe Huawei FusionInsight HD success cases in the industry.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 542
Quiz

• What are the features of FusionInsight HD?


• Which encryption algorithms are supported byHive / Hbase fine-grained encryption?
• A large number of small files are stored in the Hadoop HDFS, which brings great pressure to the NameNode. HBase stores a
large number of small files, and Compaction wastes I/O resources. What are the technical solutions to this problem?
• What are the levels of logs that can be adjusted?

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 543
Quiz

True or False
• Hive supports encryption of tables and columns. HBase supports encryption of tables,
column families, and columns. (T or F).
• User rights management is role-based access control and provides visualized and unified
user rights management for multiple components. (T or F).

Multiple-Answer Question
• Which of the following indicate the high reliability of FusionInsight HD? ( )
A. All components are free of SPOFs. C. Health status monitoring for the software and hardware.
B. All management nodes support HA. D. Network plane isolation.

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 544
More Information

• Training materials:
– http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term1000025450&id=Node1000011796

• Exam outline:
– http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node1000011797

• Mock exam:
– http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node1000011798

• Authentication process:
– http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved. Page 545
THANK YOU!

Copyright ©2019 Huawei Technologies Co., Ltd. All rights reserved.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy