Big Data Unit 2
Big Data Unit 2
Hadoop Ecosystem
Hadoop is an open source framework based on Java that manages the storage and
processing of large amounts of data for applications. Hadoop uses distributed storage and
parallel processing to handle big data and analytics jobs, breaking workloads down into
smaller workloads that can be run at the same time. It has various components such as
HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache
HBase and HBase components, HCatalog, Avro, Thrift, Drill, Apache
mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie
1
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
The four modules comprise the primary Hadoop framework and work collectively to
form the Hadoop ecosystem:
Hadoop Distributed File System (HDFS): As the primary component of the Hadoop
ecosystem, HDFS is a distributed file system in which individual Hadoop nodes operate
on data that resides in their local storage. This removes network latency, providing high-
throughput access to application data. In addition, administrators don’t need to define
schemas up front. IT is the primary data storage system used by Hadoop
applications. HDFS is a Java-based system that allows large data sets to be stored
across nodes in a cluster in a fault-tolerant manner. provides scalable, fault tolerance,
reliable and cost efficient data storage for Big data. HDFS is a distributed file system that
2
runs on commodity hardware. HDFS is already configured with default configuration for
many installations. Most of the time for large clusters configuration is needed. Hadoop
interact directly with HDFS by shell-like commands.
It is responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware (affordable, easily replaceable computer
hardware that is typically built from off-the-shelf components ) in the distributed
environment. Undoubtedly, making Hadoop cost effective.
Functions/Tasks of DataNode
DataNode is responsible for serving the client read/write requests.
Based on the instruction from the NameNode, DataNodes performs block creation,
replication, and deletion.
DataNodes send a heartbeat to NameNode to report the health of HDFS.
DataNodes also sends block reports to NameNode to report the list of blocks it
contains.
3
FsImage is a file stored on the OS filesystem that contains the complete directory
structure (namespace) of the HDFS with details about the location of the data on the Data
ABlocks and which blocks are stored on which node. This file is used by the NameNode
when it is started.
EditLogs is a transaction log that records the changes in the HDFS file system or any
action performed on the HDFS cluster such as addition of a new block, replication,
deletion etc. In short, it records the changes since the last FsImage was created.
The secondary namenode is a dedicated node in the HDFS cluster that takes checkpoints
of the file system metadata on the namenode. It is a helper node that helps the primary
namenode and merges namespaces.
4
HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
Resource Manager
Node Manager
Application Master
Resource Manager
ResourceManager (RM) is the master that arbitrates all the available cluster resources and
thus helps manage the distributed applications running on the YARN system. It works
together with the per-node NodeManagers (NMs) and the per-application
ApplicationMasters (AMs).
5
1. NodeManagers take instructions from the ResourceManager and manage resources
available on a single node.
2. ApplicationMasters are responsible for negotiating resources with the
ResourceManager and for working with the NodeManagers to start the containers.
Node Manager
The Node Manager (NM) is YARN’s per-node agent, and takes care of the individual
compute nodes in a Hadoop cluster. This includes keeping up-to date with the Resource
Manager (RM), overseeing containers’ life-cycle management; monitoring resource
usage (memory, CPU) of individual containers, tracking node-health, log’s management
and auxiliary services which may be exploited by different YARN applications.
Application Master
The Application Master is, in effect, an instance of a framework-specific library and is
responsible for negotiating resources from the Resource Manager and working with the
Node Manager(s) to execute and monitor the containers and their resource consumption.
It has the responsibility of negotiating appropriate resource containers from the Resource
Manager, tracking their status and monitoring progress.
6
MapReduce: MapReduce is a programming model for large-scale data processing.
MapReduce processes a large volume of data in parallel, by dividing a task into
independent sub-tasks. The MapReduce works by breaking the processing into two tasks:
the map task and the reduce task.
Map takes a set of data and converts it into another set of data, where individual elements
are broken down into key-value pairs. Then the reduce task receives the output from a
map as an input which is in key-value pairs and combines those data into a smaller set of
key-value pairs.
By default, the type input type in MapReduce is ‘text’. MapReduce programs written in
various languages such as Java, Ruby, Python, and C++.
MapReduce frameworks have multiple steps and processes or tasks. MapReduce jobs are
complex and involve multiple steps; some steps are performed by Hadoop with default
behavior and can be overridden if needed.
The MapReduce program is executed in three main phases: mapping phase, shuffling and
sorting phase, and reducing phase. There is also an optional phase known as the combiner
phase.
Mapping phase
This is the first phase of the program. There are two steps in this phase: splitting and
mapping. The input file is divided into smaller equal chunks for efficiency which are
7
called input splits. Since the Mappers understand (key, value) pairs only so Hadoop uses
a RecordReader that uses TextInputFormat to transform input splits into key-value pairs.
Input Split is a logical partition of the data, basically used during data processing in the
MapReduce program or other processing techniques.his logical partition of data is
processed one per Mapper
In MapReduce, parallelism will be achieved by Mapper. For each Input split, a new
instance of the mapper is instantiated. The mapping step contains a coding logic that is
applied to these data blocks. In this step, the mapper processes the key-value pairs and
produces an output of the same form (key-value pairs).
Sorting and shuffling are responsible for creating a unique key and a list of values.
Making similar keys at one location is known as Sorting. And the process by which the
intermediate output of the mapper is sorted and sent across to the reducers is known as
Shuffling.
Reducer phase
The output of the shuffle and sorting phase is used as the input to the Reducer phase and
the Reducer will process on the list of values. Each key could be sent to a different
Reducer. Reducer can set the value, and that will be consolidated in the final output of a
MapReduce job and the value will be saved in HDFS as the final output.
Suppose the text file which we are using is called test.txt and it contains the following
data:
8
The output which we expect should look like this:
Apache –1
Data – 2
Hadoop – 3
Java – 1
Python –2
Suppose a user runs a query (count number of occurrences of all the unique words) on
our test.txt file. To keep a track of our requests, we use Job Tracker (a master service),
the JobTracker traps our request and keeps a track of it.
First, the records are divided into smaller chunks for efficiency, in our case the input is
divided into 3 chunks which are called input splits. Since there are three input splits, three
different mappers will be used for each split.
9
Mapper then parses the line, gets a word, and sets <<word>, 1> for each word. In this
example, the output of Mapper for a line Java Python Hadoop will be <Java, 1>,
<Python, 1>, and <Hadoop, 1>.
All Mappers will have a key as a word and value as hardcoded value 1. The reason for
giving a hardcoded value 1 and not any other value is because every word in itself will
occur once.
In the next phase ( shuffle and sorting ) the key-value pair output from the Mapper having
the same key will be consolidated. So key with ‘Hadoop’, ‘Data’, ‘Java’, and others will
be consolidated, and values will be appended as a list, in this case <Java, List<1>>,
<Python, List<1,1>>, <Hadoop, List<1, 1, 1, 1>>and so on.
The key produced by Mappers will be compared and sorted. The key and list of values
will be sent to the next step in the sorted sequence of the key.
Next, the reducer phase will get <key, List<>> as input, and will just count the number of
1s in the list and will set the count value as output. for example, the output for certain
steps are given as follows:
In the end, all the output key/value pairs from the Reducer phase will be consolidated to a
file and will be saved in HDFS as the final output.
MapReduce is used in many applications let us have a look at some of the applications.
Entertainment :
A lot of web series and movies are released on various OTT platforms such as Netflix.
You might have come across a situation where you can’t decide which movie to watch
and you take a look at the suggestions provided by Netflix and then you try one of the
suggested series or movies.
10
Hadoop and MapReduce are used by Netflix to recommend the user some popular
movies based on what they watched and which movie they like.
MapReduce can determine how users are watching movies, analyzing their logs and
clicks.
E-commerce:
Many E-commerce companies such as Flipkart, Amazon, and eBay use MapReduce to
analyze the buying behavior of the customers based on customers’ interests or their
buying behavior. It analyzes records, purchase history, user interaction logs, etc., and
provides product recommendation mechanisms for various e-commerce companies.
11
If there is no response to the signal, it is understood that there are technical
problems with the DataNode.
Advantages of MapReduce
Scalability: MapReduce can handle petabytes of data by splitting it into smaller chunks
and assigning them to multiple nodes in a cluster.
Fault-tolerance: MapReduce can handle petabytes of data by splitting it into smaller
chunks and assigning them to multiple nodes in a cluster.
Parallel processing-compatible: MapReduce is suitable for iterative computation
involving large quantities of data requiring parallel processing.
Versatile: Businesses can use MapReduce programming to access new data sources.
Fast-paced: MapReduce is fast-paced.
Based on a simple programming model: MapReduce is based on a simple programming
model.
12
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared
by other Hadoop modules.
Hadoop consists of the Hadoop Common package, which provides file system and
operating system level abstractions, a MapReduce engine (either MapReduce/MR1 or
YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common
package contains the Java Archive (JAR) files and scripts needed to start Hadoop.
Hadoop archive is a facility which packs up small files into one compact HDFSblock to
avoid memory wastage of name node.name node stores the metadata information of the
the HDFS data.SO,say 1GB file is broken in 1000 pieces then namenode will have to
store metadata about all those 1000 small files.In that manner,namenode memory willbe
wasted it storing and managing a lot of data.
HAR is created from a collection of files and the archiving tool will run
a MapReduce job.these Maps reduce jobs to process the input files in parallel to create an
archive file.
HARSyntax:
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example:
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this
archive for MapReduce input, all you need to specify the input directory as
har:///user/zoo/foo.har.
/data/myArch..har/_index
/data/myArch..har/_masterindex
/data/myArch..har/part-0
part files are the original files concatenated together with big files and index files are to
look up for the small files in the big part file.
13
after creation of archive to release some disk space.
2) Once an archive is created, to add or remove files from/to archive we need to re-create
the archive.
3) HAR file will require lots of map tasks which are inefficient.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem
continues to grow and includes many tools and applications to help collect, store, process,
analyze, and manage big data. These include Apache Pig, Apache Hive, Apache
HBase,Apache Spark, Presto, and Apache Zeppelin.
Hadoop is often used by companies who need to handle and store big data.
Zookeeper
If an Active NameNode falters, the Zookeeper daemon detects the failure and carries out
the failover process to a new NameNode. Use Zookeeper to automate failovers and
minimize the impact a NameNode failure can have on the cluster. Apache Zookeeper is
an open-source project that provides a centralized service for distributed systems. It
provides configuration information, naming, synchronization, and group services over
large clusters.
Zookeeper is a centralized repository where distributed applications can put data and get
data out of it. It is used to keep the distributed system functioning together as a single
unit.
14
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those
that are triggered when some data or external stimulus is given to it.
15
Hbase:
What is HBase?
An HBase system is designed to scale linearly. It comprises a set of standard tables with
rows and columns, much like a traditional database. Each table must have an element
defined as a primary key, and all access attempts to HBase tables must use this primary
key.
Avro, as a component, supports a rich set of primitive data types including: numeric,
binary data and strings; and a number of complex types including arrays, maps,
enumerations and records. A sort order can also be defined for the data.
An example of HBase
HBase allows for many attributes to be grouped together into column families, such that
the elements of a column family are all stored together. This is different from a row-
oriented relational database, where all the columns of a given row are stored together.
With HBase you must predefine the table schema and specify the column families.
However, new columns can be added to families at any time, making the schema flexible
and able to adapt to changing application requirements.
16
Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker and
TaskTracker slaves, HBase is built on similar concepts. In HBase a master node manages
the cluster and region servers store portions of the tables and perform the work on the
data. In the same way HDFS has some enterprise concerns due to the availability of the
NameNode HBase is also sensitive to the loss of its master node.
Limitations of Hadoop
Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data
in a single unit of time (random access).
Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of
the databases that store huge amounts of data and access the data in a random manner.
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It
is an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
17
Comparision of HBase and HDFS
HDFS HBase
HDFS is a distributed file system HBase is a database built on top of the
suitable for storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch It provides low latency access to single
processing; no concept of batch rows from billions of records (Random
processing. access).
HBase internally uses Hash tables and
It provides only sequential access of provides random access, and it stores
data. the data in indexed HDFS files for
faster lookup
18
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.
“OLAP is optimized for complex data analysis and reporting, while OLTP is optimized
for transactional processing and real-time updates.”
“Performance for an OLAP system differs and can range from minutes to hours,
depending on the query complexity and volume of data. For an OLTP, processing occurs
in real-time and can be as fast as milliseconds”
Reading and writing data is much more efficient in a columnar database OLAP than a
row-oriented one OLTP.
19
Features of HBase
HBase is linearly scalable.
It has automatic failure support.
20
It provides consistent read and writes.
It integrates with Hadoop, both as a source and a destination.
It has easy java API for client.
It provides data replication across clusters.
Where to Use HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Bigtable.
Bigtable acts up on Google File System, likewise Apache HBase works on top of
Hadoop and HDFS.
Applications of HBase
It is used whenever there is a need to write heavy applications.
HBase is used whenever we need to provide fast random access to available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Hive
What is Hive in Hadoop?
The Apache Hive™ data warehouse software facilitates reading, writing, and
managing large datasets residing in distributed storage using SQL. Hive is an
open-source system that processes structured data in Hadoop. Apache Hive is built
on top of Apache Hadoop for providing data query and analysis. Hive gives an
SQL-like interface to query data stored in various databases and file systems that
integrate with Hadoop. The programming language used in hive is Java.
Architecture of hive
21
Hive chiefly consists of three core parts:
Hive Clients: Hive offers a variety of drivers designed for communication with
different applications. For example, Hive provides Thrift clients for Thrift-based
applications. These clients and drivers then communicate with the Hive server, which
falls under Hive services.
Hive Services: Hive services perform client interactions with Hive. For example, if a
client wants to perform a query, it must talk with Hive services.
Hive Storage and Computing: Hive services such as file system, job client, and meta
store then communicates with Hive storage and stores things like metadata table
information and query results.
Hive's Features
Hive is designed for querying and managing only structured data stored in tables
Hive is scalable, fast, and uses familiar concepts
Schema gets stored in a database, while processed data goes into a Hadoop Distributed
File System (HDFS)
Tables and databases get created first; then data gets loaded into the proper tables
Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record Columnar
File), and TEXTFILE
Hive uses an SQL-inspired language, sparing the user from dealing with the
complexity of MapReduce programming. It makes learning more accessible by
utilizing familiar concepts found in relational databases, such as columns, tables, rows,
and schema, etc.
The most significant difference between the Hive Query Language (HQL) and SQL is
that Hive executes queries on Hadoop's infrastructure instead of on a traditional
database
Since Hadoop's programming works on flat files, Hive uses directory structures to
"partition" data, improving performance on specific queries
22
Hive supports partition and buckets for fast and simple data retrieval
Hive supports custom user-defined functions (UDF) for tasks like data cleansing and
filtering. Hive UDFs can be defined according to programmers' requirements
Limitations of Hive
Of course, no resource is perfect, and Hive has some limitations. They are:
Hive doesn’t support OLTP. Hive supports Online Analytical Processing (OLAP), but
not Online Transaction Processing (OLTP).
It doesn’t support subqueries.
It has a high latency.
Hive tables don’t support delete or update operations.
1. The data analyst executes a query with the User Interface (UI).
2. The driver interacts with the query compiler to retrieve the plan, which consists of the
query execution process and metadata information. The driver also parses the query to
check syntax and requirements.
3. The compiler creates the job plan (metadata) to be executed and communicates with
the metastore to retrieve a metadata request.
4. The metastore sends the metadata information back to the compiler
5. The compiler relays the proposed query execution plan to the driver.
6. The driver sends the execution plans to the execution engine.
7. The execution engine (EE) processes the query by acting as a bridge between the Hive
and Hadoop. The job process executes in MapReduce. The execution engine sends the
job to the JobTracker, found in the Name node, and assigns it to the TaskTracker, in
the Data node. While this is happening, the execution engine executes metadata
operations with the metastore.
23
8. The results are retrieved from the data nodes.
9. The results are sent to the execution engine, which, in turn, sends the results back to
the driver and the front end (UI).
Since we have gone on at length about what Hive is, we should also touch on what Hive
is not:
Hive Modes
Depending on the size of Hadoop data nodes, Hive can operate in two different modes:
Local mode
Map-reduce mode
User Local mode when:
Hadoop is installed under the pseudo mode, possessing only one data node
The data size is smaller and limited to a single local machine
Users expect faster processing because the local machine contains smaller datasets.
Use Map Reduce mode when:
Hadoop has multiple data nodes, and the data is distributed across these different
nodes
Users must deal with more massive data sets
MapReduce is Hive's default mode.
24
Hive: Hive is a platform that used to create SQL-type scripts for MapReduce
functions
Varied schema
Dense tables
Pig
Apache Pig is a high-level procedural language platform for creating programs that
run on Apache Hadoop. The language used for Pig is Pig Latin Pig Latin is very
similar to SQL, it is comparatively easy to learn Apache Pig if we have little
knowledge of SQL. After the introduction of Pig Latin, now, programmers are able
to work on MapReduce tasks without the use of complicated codes as in Java.
1. In-built operators: Apache Pig provides a very good set of operators for performing
several data operations like sort, join, filter, etc.
2. Ease of programming: Since Pig Latin has similarities with SQL, it is very easy to
write a Pig script.
3. Automatic optimization: The tasks in Apache Pig are automatically optimized. This
makes the programmers concentrate only on the semantics of the language.
4. Handles all kinds of data: Apache Pig can analyze both structured and unstructured
data and store the results in HDFS.
25
Apache Pig Architecture
The main reason why programmers have started using Hadoop Pig is that it converts the
scripts into a series of MapReduce tasks making their job easy. Below is the architecture
of Pig Hadoop:
1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the
parser. The parser is responsible for checking the syntax of the script, along with
other miscellaneous checks. Parser gives an output in the form of a Directed Acyclic
Graph (DAG) that contains Pig Latin statements, together with other logical
operators represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is
passed to a logical optimizer. The optimizer is responsible for carrying out the logical
optimizations.
26
3. Compiler: The role of the compiler comes in when the output from the optimizer is
received. The compiler compiles the logical plan sent by the optimizing The logical
plan is then converted into a series of MapReduce tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs
are sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop
for yielding the desired result.
27
Hive uses a declarative Pig uses a unique
language variant of procedural language
Language Used SQL called HQL called Pig Latin
Flume
Apache Flume is a tool for collecting, aggregating, and transporting large amounts of
streaming data. It's commonly used in big data environments to ingest log files, social
media data, clickstreams, and other high-volume data sources.
Flume is a special-purpose tool for sending data into HDFS. It's a critical component for
building end-to-end streaming workloads, with typical use cases including:
Fraud detection
Internet of Things applications
Aggregation of sensor and machine data
Alerting/SIEM
28
Here's how Flume works:
1. The Flume source picks up log files from data generating sources like web servers and
Twitter and sends it to the channel.
2. The Flume's sink component ensures that the data it receives is synced to the destination.
3. Flume moves these files to the Hadoop Distributed File System (HDFS) for further
processing.
4. Flume is flexible to write to other storage solutions like HBase or Solr.
Flume is robust and fault tolerant with tunable reliability mechanisms and many failover
and recovery mechanisms.
Sqoop:
Sqoop is defined as the tool which is used to perform data transfer operations from
relational database management system to Hadoop server. Thus it helps in transfer of
bulk of data from one point of source to another point of source.
Some of the important Features of the Sqoop :
Sqoop also helps us to connect the result from the SQL Queries into Hadoop
distributed file system.
Sqoop helps us to load the processed data directly into the hive or Hbase.
It performs the security operation of data with the help of Kerberos.
With the help of Sqoop, we can perform compression of processed data.
Sqoop is highly powerful and efficient in nature.
29
There are two major operations performed in Sqoop :
1. Import
2. Export
Scoop Import:
The ‘ Scoop Import tool’ imports individual tables from RDBMS to HDFS. Each row in a
table is treated as a record in HDFS. All records are stored as text data in the text files or
as binary data in Avro and Sequence files.
Syntax
Scoop Export:
The Scoop Export tool is used to export data back from the HDFS to the RDBMS
database. The target table must exist in the target database. The files which are given as
input to the Sqoop contain records, which are called rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.
30
The default operation is to insert all the record from the input files to the database table
using the INSERT statement. In update mode, Sqoop generates the UPDATE statement
that replaces the existing record into the database.
Syntax
31