0% found this document useful (0 votes)
1 views

Hadoop

Uploaded by

Huỳnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Hadoop

Uploaded by

Huỳnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/354378675

Hadoop Ecosystem: Technology Study, Architecture and Analysis

Technical Report · September 2021


DOI: 10.13140/RG.2.2.26216.39683

CITATIONS READS
0 1,442

1 author:

Pooja Pandit

28 PUBLICATIONS 14 CITATIONS

SEE PROFILE

All content following this page was uploaded by Pooja Pandit on 06 September 2021.

The user has requested enhancement of the downloaded file.


2021. Big Data Analytics.

Hadoop Ecosystem: Technology Study, Architecture


and Analysis
Pooja D Pandit,
Mumbai University,
India,

Fig. 1: The Hadoop Ecosystem Overview


Abstract—In this paper, we study the Hadoop Ecosystem. Fig. 2: The Hadoop Distributed File System Architecture
Specifically, we first present the overall Hadoop architecture. We
study the various components of the Hadoop Ecosystem such as
A. HDFS(Hadoop Distributed File System)
Hadoop Distributed File System, Map Reduce, YARN, etc. Next,
we study the properties of the HDFS and YARN Mapreduce
system. It is used to manage big data sets with High Volume, Veloc-
ity and Variety. HDFS implements master slave architecture.
I. I NTRODUCTION Master is Name node and slave is Data node. Its features are
Recent times have seen an exponential growth in the amount as follows:
of data that is being generated by various applications. This
data is being used by a number of institutions to generate • Scalable
useful trends to understand the user behavior. User data not • Reliable
only reflects what a user is currently thinking about but is also • Commodity Hardware
a good indicator of how a user thinks. This data also exposes The HDFS architecture is as shown in Fig. 3. The two
the trends necessary for understanding the design of next key entities are the name node and the data node which are
generation products. Further, researchers have shown a number described as follows.
of use cases for WLAN networks for data accumulation and
processing [1]–[6]. Further, with new advancement in wireless • Name node: The HDFS namespace is a hierarchy of files
networks the data generated by users is bound to increase [7]– and directories. Files and directories are represented on
[12]. the NameNode by inodes. Inodes record attributes like
In order to process such a high amount of generated permissions, modification and access times, namespace
data, massive computation systems are required. Hadoop is and disk space quotas.The NameNode maintains the
a framework that enables a distributed storage and processing namespace tree and the mapping of blocks to DataNodes.
of massive amounts of data. It has been used in a number • Data node: Each block replica on a DataNode is repre-
of use cases. For instance, more than half of the Fortune 50 sented by two files in the local native filesystem. The first
companies claimed that they used Hadoop. file contains the data itself and the second file records the
In this paper, we study the Hadoop Ecosystem in depth. block’s metadata including checksums for the data and
the generation stamp. The size of the data file equals the
II. H ADOOP S YSTEM actual length of the block and does not require extra space
This section describes the various components of the to round it up to the nominal block size as in traditional
Hadoop Ecosystem. The Hadoop Ecosystem is as illustrated filesystems. Thus, if a block is half full it needs only half
in Fig. 1. of the space of the full block on the local drive.
2021. Big Data Analytics.

B. MapReduce User Hive Command


WEB UI HD Insight
Interfaces Line
MapReduce is a processing technique and a program model
for distributed computing based on java. The MapReduce
algorithm contains two important tasks, namely Map and
Reduce. Map takes a set of data and converts it into another Hive QL Process Engine
Meta Store
set of data, where individual elements are broken down into
tuples (key/value pairs). Secondly, reduce task, which takes Execution Engine

the output from a map as an input and combines those data


tuples into a smaller set of tuples. As the sequence of the name MAP Reduce
MapReduce implies, the reduce task is always performed after
the map job.
HDFS or HBASE Data Storage
Generally MapReduce paradigm is based on sending the
computer to where the data resides. MapReduce program Fig. 3: Hive Architecture
executes in three stages, namely map stage, shuffle stage, and
reduce stage. as data flow sequences. Pig programs accomplish huge
tasks, but they are easy to write and maintain.
• Map stage : The map or mapper’s job is to process the
• Optimization opportunities: Because the system automat-
input data. Generally the input data is in the form of
ically optimizes execution of Pig jobs, the user can focus
file or directory and is stored in the Hadoop file system
on semantics.
(HDFS). The input file is passed to the mapper function
• Extensibility: Pig users can create custom functions to
line by line. The mapper processes the data and creates
meet their particular processing requirements.
several small chunks of data.
• Reduce stage : This stage is the combination of the
2) Hive: Apache Hive is another high level query language
Shuffle stage and the Reduce stage. The Reducer’s job and data warehouse infrastructure built on top of Hadoop for
is to process the data that comes from the mapper. After providing data summarization, query and analysis of structured
processing, it produces a new set of output, which will data. It was developed by yahoo and they made it an open
be stored in the HDFS. source.Hive provides a database query interface to Apache
Hadoop.
MapReduce has two Deamons:
The description of various components of the Hive archi-
• Job tracker: Schedules jobs and tracks the assign jobs to tecture are as follows.
Task tracker. • User Interface: Hive is a data warehouse infrastructure
• Task Tracker: Tracks the task and reports status to Job
software that can create interaction between user and
Tracker. HDFS. The user interfaces that Hive supports are Hive
C. YARN(Yet Another Resource Negotiator) Web UI, Hive command line, and Hive HD Insight (In
Windows server).
It is also called as MapReduce 2(MRv2). The fundamental
• Meta Store: Hive chooses respective database servers
idea of MRv2 is to split up the two major functionalities
to store the schema or Metadata of tables, databases,
of the JobTracker, resource management and job schedul-
columns in a table, their data types, and HDFS mapping.
ing/monitoring, into separate daemons. The idea is to have
• HiveQL Process Engine: HiveQL is similar to SQL for
a global ResourceManager (RM) and per-application Applica-
querying on schema info on the Metastore. It is one of
tionMaster (AM).
the replacements of traditional approach for MapReduce
• The ResourceManager is the ultimate authority that arbi-
program. Instead of writing MapReduce program in Java,
trates resources among all the applications in the system. we can write a query for MapReduce job and process it.
• The per-application ApplicationMaster is, in effect, a
• Execution Engine: The conjunction part of HiveQL pro-
framework specific library and is tasked with negotiating cess Engine and MapReduce is Hive Execution Engine.
resources from the ResourceManager and working with Execution engine processes the query and generates re-
the NodeManager(s) to execute and monitor the tasks. sults as same as MapReduce results. It uses the flavor of
D. Data Access MapReduce.
• HDFS or HBASE: Hadoop distributed file system or
1) Pig: Apache Pig is a high level language built on HBASE are the data storage techniques to store data into
top of MapReduce for analyzing large datasets with simple file system.
adhoc data analysis programs. Pig is also known as Data
Flow language. It is very well integrated with python. It was E. Data Storage
developed by yahoo. It is a high level scripting language that 1) Hbase: Apache HBase is an open source NoSQL
is used with Apache Hadoop. Salient features of pig. database that provides real-time read/write access to those
• Ease of programming: Complex tasks involving interre- large datasets. HBase scales linearly to handle huge data sets
lated data transformations can be simplified and encoded with billions of rows and millions of columns, and it easily
2021. Big Data Analytics.

H. Management, Monitoring and Orchestration


1) Apache Ambari: Ambari was created to help manage
Hadoop. It offers support for many of the tools in the Hadoop
ecosystem including Hive, HBase, Piq, Sqoop and Zookeeper.
The tool features a management dashboard that keeps track of
cluster health and can help diagnose performance issues.
2) Apache Zookeeper: Writing distributed applications is
difficult because of partial failure may occur between nodes to
Fig. 4: Apache Flume Architecture overcome this Apache Zookeeper has been developed by main-
taining an open-source server which enables highly reliable
combine data sources that use a wide variety of different distributed coordination. ZooKeeper is a centralized service
structures and schemas. HBase is natively integrated with for maintaining configuration information, naming, providing
Hadoop and works seamlessly alongside other data access distributed synchronization, and providing group services . In
engines through YARN. HBase can be used for storing semi- case of any partial failure clients can connect to any node
structured data like log data and then providing that data very and be assured that they will receive the correct, up-to-date
quickly to users or applications integrated with HBase. Various information.
characteristics and benefits of the Hbase system are as follows. 3) Apache Oozie: It is a workflow scheduler system to
• Fault tolerance: Replication across the data center. manage hadoop jobs. It is a server-based Workflow Engine
Atomic and strongly consistent row-level operations. specialized in running workflow jobs with actions that run
High availability through automatic failover. Automatic Hadoop MapReduce and Pig jobs. Oozie is implemented as a
sharding and load balancings of tables. Java Web-Application that runs in a Java Servlet-Container
• Fast: Near real time lookups. In-memory caching via
I. R-connections
block cache and bloom filters. Server side processing via
filters and co-processors. Oracle R Connector for Hadoop is a collection of R pack-
• Usable: Data model accommodates wide range of use ages that provide.
cases. Metrics exports via File and Ganglia plugins. Easy • Interfaces to work with Hive tables, the Apache Hadoop

Java API as well as Thrift and REST gateway APIs. compute infrastructure, the local R environment, and
Oracle database tables.
F. Data Intelligence • Predictive analytic techniques, written in R or Java as
Hadoop MapReduce jobs, that can be applied to data in
1) Mahout: Mahout is a library of scalable machine- HDFS files
learning algorithms,which is a discipline of artificial intelli-
gence focused on enabling machines to learn without being III. P ROPERTIES OF H ADOOP D ISTRIBUTED F ILE S YSTEM
explicitly programmed, and it is commonly used to improve The properties of the HDFS System are as follows:
future performance based on previous outcomes. So Mahout is • Fault tolerance by detecting faults and applying quick,
implemented on top of Apache Hadoop.Mahout provides the automatic recovery
data science tools to automatically find meaningful patterns in • Portability across heterogeneous commodity hardware
those big data sets. The Apache Mahout project aims to make and operating systems
it faster and easier to turn big data into big information. • Scalability to reliably store and process large amounts of
data
G. Data Integration • Data storage reliability by automatically maintaining

1) Apache Sqoop: Apache Sqoop is a tool designed for multiple copies of data and automatically redeploying
bulk data transfers between relational databases and Hadoop. processing logic in the event of failure.
• Staging to commit: When a client creates a file in HDFS,
The features are as follows.
it first caches the data into a temporary local file. It then
• Import and export to and from HDFS. redirects subsequent writes to the temporary file. When
• Import and export to and from Hive. the temporary file accumulates enough data to fill an
• Import and export to HBase. HDFS block, the client reports this to the name node,
2) Apache Flume: Flume is a distributed, reliable, and which converts the file to a permanent data node.
available service for efficiently collecting, aggregating, and • Data block rebalancing: HDFS data blocks might not
moving large amounts of log data. It has a simple and flexible always be placed uniformly across data nodes, meaning
architecture based on streaming data flows. It is robust and that the used space for one or more data nodes can be
fault tolerant with tunable reliability mechanisms and many underutilized. Therefore, HDFS supports rebalancing data
failover and recovery mechanisms. It uses a simple extensible blocks using various models. One model might move data
data model that allows for online analytic application. The blocks from one data node to another automatically if the
architecture is shown in Fig. 4. free space on a data node falls too low. Another model
2021. Big Data Analytics.

might dynamically create additional replicas and rebal- B. JAQL(Query language for JSON)
ance other data blocks in a cluster if a sudden increase in JAQL is a functional, declarative programming language de-
demand for a given file occurs. HDFS also provides the signed especially for working with large volumes of structured,
hadoop balance command for manual rebalancing tasks. semi-structured and unstructured data. It was developed by
• Data integrity: HDFS goes to great lengths to ensure IBM. Data structure that it operates on is JSON (JavaScript
the integrity of data across clusters. It uses checksum Object Notation) which is a lightweight data-interchange for-
validation on the contents of HDFS files by storing mat. It is easy for humans to read and write. It is easy for
computed checksums in separate, hidden files in the same machines to parse and generate. Jaql allows you to select,
namespace as the actual data. When a client retrieves join, group, and filter data that is stored in HDFS, much like
file data, it can verify that the data received matches the a blend of Pig and Hive.
checksum stored in the associated file.
• HDFS permissions for users, files, and directories: HDFS VI. C ONCLUSION
implements a permissions model for files and directories In this paper, we study the Hadoop Ecosystem. Specifically,
that has a lot in common with the Portable Operating we first present the overall Hadoop architecture. We study the
System Interface (POSIX) model; for example, every file various components of the Hadoop Ecosystem such as Hadoop
and directory is associated with an owner and a group. Distributed File System, Map Reduce, YARN, etc. Next, we
The HDFS permissions model supports read (r), write study the properties of the HDFS and YARN Mapreduce
(w), and execute (x). system.

IV. P ROPERTIES OF YARN M AP R EDUCE FRAMEWORK R EFERENCES


[1] Peshal Nayak and Edward W Knightly. uscope: a tool for network
The properties of the YARN Map Reduce framework are as managers to validate delay-based slas. In Proceedings of the Twenty-
follows. second International Symposium on Theory, Algorithmic Foundations,
and Protocol Design for Mobile Networks and Mobile Computing, pages
• Uberization is the possibility to run all tasks of a 171–180, 2021.
MapReduce job in the ApplicationMaster’s JVM if the [2] Peshal Nayak, Santosh Pandey, and Edward W Knightly. Virtual speed
test: an ap tool for passive analysis of wireless lans. In IEEE INFOCOM
job is small enough. This way, we avoid the overhead 2019-IEEE Conference on Computer Communications, pages 2305–
of requesting containers from the ResourceManager and 2313. IEEE, 2019.
asking the NodeManagers to start small tasks. [3] P. Nayak, M. Garetto, and E. W. Knightly. Multi-user downlink with
single-user uplink can starve TCP. In IEEE INFOCOM. IEEE, 2017.
• Binary or source compatibility for MapReduce jobs writ- [4] P. Nayak. AP-side WLAN Analytics. PhD thesis, Rice University, 2019.
ten for MRv1 (MAPREDUCE-5108). [5] P. Nayak, M. Garetto, and E. W. Knightly. Modeling Multi-User WLANs
• High availability for the ResourceManager (YARN-149). Under Closed-Loop Traffic. IEEE/ACM Transactions on Networking,
2019.
If the ResourceManager is restarted, it recreates the state [6] P. Nayak. Performance Evaluation of MU-MIMO WLANs Under the
of and is already done by some vendors. Impact of Traffic Dynamics. Master’s thesis, 2016.
• The ResourceManager stores information about running [7] P. B. Nayak, S. Verma, and P. Kumar. Multiband fractal antenna design
for Cognitive radio applications. In Proc. of ICSC. IEEE, 2013.
applications and completed tasks applications and re-runs [8] P. B. Nayak, S. Verma, and P. Kumar. A novel compact tri-band antenna
only incomplete tasks. This work is close to completion design for WiMax, WLAN and bluetooth applications. In Proc of NCC.
and has been actively tested by the community. IEEE, 2014.
[9] P. B. Nayak, R. Endluri, S. Verma, and P. Kumar. Compact dual-band
• Simplified user-log management and access: Logs gener- antenna for WLAN applications. In Proc. of PIMRC). IEEE, 2013.
ated by applications are not left on individual slave nodes [10] P. B. Nayak, S. Verma, and P. Kumar. Ultrawideband (UWB) Antenna
(as with MRv1) but are moved to a central storage, such Design for Cognitive Radio. In Proc. of CODEC. IEEE, 2012.
[11] R. Endluri, P. B. Nayak, and P. Kumar. A Low Cost Dual Band
as HDFS. Later, they can be used for debugging purposes Antenna for Bluetooth, 2.3 GHz WiMAX and 2.4/5.2/5.8 GHz WLAN.
or for historical analyses to discover performance issues. International Journal of Computer Applications.
• A new look and feel of the web interface. [12] Peshal Nayak. Performance evaluation of mu-mimo under the impact
of open loop traffic dynamics. arXiv preprint arXiv:2108.03745, 2021.

V. L ANGUAGES FOR DATA ACCESS FOR HDFS


The two languages used to access data from HDFS are
JAQL and Pig.

A. Pig
PIG deals with structured data using PIG LATIN which is
a scripting language. It was developed by Yahoo. The data
structure that it operates on is complex and nested. Pig Latin
includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop
their own functions for reading, processing, and writing data.

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy