UNIT2 BDA
UNIT2 BDA
Our Hadoop tutorial includes all topics of Big Data Hadoop with HDFS,
MapReduce, Yarn, Hive, HBase, Pig, Sqoop etc.
Hadoop Ecosystem
The Hadoop Ecosystem is a group of software tools and frameworks. It is
based on the core components of Apache Hadoop. It enables storing,
ASS.PRO.UPEKSHA CHAUDHRI 1
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
In the above diagram, we can see the components that collectively form a
Hadoop
Components Description
ASS.PRO.UPEKSHA CHAUDHRI 2
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
Components Description
Ecosystem.
Now we will learn about each of the components in detail.
Hadoop Distributed File System
• HDFS is the primary storage system in the Hadoop Ecosystem.
• HDFS divides data into blocks and distributes them across the
cluster for fault tolerance and high availability.
ASS.PRO.UPEKSHA CHAUDHRI 3
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
HDFS Architecture
The main purpose of HDFS is to ensure that data is preserved even in the
event of failures such as NameNode failures, DataNode failures, and
network partitions.
HDFS uses a master/slave architecture, where one device (master) controls
one or more other devices (slaves).
Important points about HDFS architecture:
1. Files are split into fixed-size chunks and replicated across
multiple DataNodes.
o Resource Manager
o Node Manager
o Application Manager
Yarn Architecture
• MapReduce can work with big data. It splits tasks into smaller
parts called mapping and reducing, which can be done
simultaneously.
ASS.PRO.UPEKSHA CHAUDHRI 5
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
ASS.PRO.UPEKSHA CHAUDHRI 6
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
• This helps add data from different places like log files, social
media, and sensors.
Hadoop Performance Optimization
Here are some methods for optimizing Hadoop Ecosystem in Big Data
Take Advantage of HDFS Block Size Optimization
• Configure the HDFS block size based on the typical size of your
data files.
ASS.PRO.UPEKSHA CHAUDHRI 7
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
**Input:**
ASS.PRO.UPEKSHA CHAUDHRI 8
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
- **Key-Value pairs:** The input data is divided into chunks, and each
chunk is represented as a key-value pair. Typically, the key is used to
identify the data record, and the value contains the actual data.
**Processing:**
- **Mapper function:** A user-defined function called the "mapper" is
applied to each key-value pair independently. The mapper function takes
the input key-value pair and emits intermediate key-value pairs based on
the processing logic. It can filter, transform, or extract information from the
input data.
**Output:**
- **Intermediate Key-Value pairs:** The mapper function generates
intermediate key-value pairs as its output. These key-value pairs are
usually different from the input key-value pairs and are emitted based on
the logic defined in the mapper function. The intermediate key-value pairs
are grouped by key and shuffled across the cluster to prepare for the next
phase.
**Input:**
- **Grouped Key-Value pairs:** The intermediate key-value pairs generated
by the map phase are shuffled and grouped based on their keys. All
intermediate values associated with the same key are collected together
and passed to the reducer function.
**Processing:**
- **Reducer function:** A user-defined function called the "reducer" is
applied to each group of intermediate values sharing the same key. The
reducer function aggregates, summarizes, or processes these values to
produce the final output.
ASS.PRO.UPEKSHA CHAUDHRI 9
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
**Output:**
- **Final Output Key-Value pairs:** The reducer function generates the final
output key-value pairs based on the processing logic. These key-value
pairs constitute the result of the MapReduce job and typically represent
the desired computation or analysis performed on the input data.
What is Serialization?
Serialization is the process of converting a data object—a
combination of code and data represented within a region of data
storage—into a series of bytes that saves the state of the object in
an easily transmittable form. In this serialized form, the data can be
delivered to another data store (such as an in-memory computing
platform), application, or some other destination.
ASS.PRO.UPEKSHA CHAUDHRI 10
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8
Data formats such as JSON and XML are often used as the format for
storing serialized data. Customer binary formats are also used, which
tend to be more space-efficient due to less markup/tagging in the
serialization.
What Is Data Serialization in Big Data?
Big data systems often include technologies/data that are described
as “schemaless.” This means that the managed data in these systems
are not structured in a strict format, as defined by a schema.
Serialization provides several benefits in this type of en vironment:
ASS.PRO.UPEKSHA CHAUDHRI 12