0% found this document useful (0 votes)
4 views

UNIT2 BDA

The document provides an overview of Apache Hadoop, an open-source framework for processing and analyzing large datasets, detailing its architecture and key components such as HDFS, YARN, and MapReduce. It also discusses the Hadoop Ecosystem, which includes tools like Hive, Pig, and HBase, and highlights their roles in data storage and processing. Additionally, the document covers performance optimization techniques and the importance of data serialization in distributed systems.

Uploaded by

Chaudhri Upeksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UNIT2 BDA

The document provides an overview of Apache Hadoop, an open-source framework for processing and analyzing large datasets, detailing its architecture and key components such as HDFS, YARN, and MapReduce. It also discusses the Hadoop Ecosystem, which includes tools like Hive, Pig, and HBase, and highlights their roles in data storage and processing. Additionally, the document covers performance optimization techniques and the importance of data serialization in distributed systems.

Uploaded by

Chaudhri Upeksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Unit-2 INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE


Big Data – Apache Hadoop & Hadoop EcoSystem,

Hadoop is an open source framework. It is provided by Apache to process


and analyze very huge volume of data. It is written in Java and currently used
by Google, Facebook, LinkedIn, Yahoo, Twitter etc.

Our Hadoop tutorial includes all topics of Big Data Hadoop with HDFS,
MapReduce, Yarn, Hive, HBase, Pig, Sqoop etc.

Apache Hadoop is an open-source software framework used for distributed


storage and processing of large datasets across clusters of computers using
simple programming models. It's designed to scale up from single servers
to thousands of machines, each offering local computation and storage.

Hadoop consists of four main modules:

1. Hadoop Common: A set of common utilities and libraries that support


other Hadoop modules.

2. Hadoop Distributed File System (HDFS): A distributed file system that


stores data across multiple machines. It provides high-throughput access to
application data and is designed to be fault-tolerant.

3. Hadoop YARN (Yet Another Resource Negotiator): A resource


management layer responsible for managing resources and scheduling
applications on the Hadoop cluster.

4. Hadoop MapReduce: A programming model and processing engine for


large-scale data processing. It allows users to write applications that process
large amounts of data in parallel across a distributed cluster.

Hadoop is widely used in industries such as finance, healthcare, advertising,


and social media for tasks like log processing, data warehousing, machine
learning, and more. It's known for its scalability, fault tolerance, and ability
to handle diverse types of data.

Hadoop Ecosystem
The Hadoop Ecosystem is a group of software tools and frameworks. It is
based on the core components of Apache Hadoop. It enables storing,
ASS.PRO.UPEKSHA CHAUDHRI 1
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

processing, and analyzing large amounts of data. It provides the


infrastructure needed to process large datasets. Hadoop distributes data
and processes tasks across clusters of computers.
Hadoop Ecosystem Components
The Hadoop Ecosystem is composed of several components. Each
component works together to enable the storage and analysis of data.

In the above diagram, we can see the components that collectively form a
Hadoop

Components Description

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

MapReduce Programming-based Data Processing

Spark InMemory Data Processing

PIG, HIVE Processing of data services on query-based.

ASS.PRO.UPEKSHA CHAUDHRI 2
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Components Description

HBase NoSQL Database

Mahout, Spark MLLib Machine Learning algorithm libraries

Zookeeper Managing cluster

Oozie Job Scheduling

Ecosystem.
Now we will learn about each of the components in detail.
Hadoop Distributed File System
• HDFS is the primary storage system in the Hadoop Ecosystem.

• It is a distributed file system that provides reliable and scalable


storage of large datasets across multiple computers.

• HDFS divides data into blocks and distributes them across the
cluster for fault tolerance and high availability.

• It consists of 2 basic components


o Node Name
o Data Node

• Node name is a primary Node. It contains metadata, requiring


comparatively free resources than Data Nodes that store the
actual data.

• It maintains all the coordination between clusters and hardware.

ASS.PRO.UPEKSHA CHAUDHRI 3
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

HDFS Architecture

The main purpose of HDFS is to ensure that data is preserved even in the
event of failures such as NameNode failures, DataNode failures, and
network partitions.
HDFS uses a master/slave architecture, where one device (master) controls
one or more other devices (slaves).
Important points about HDFS architecture:
1. Files are split into fixed-size chunks and replicated across
multiple DataNodes.

2. The NameNode contains file system metadata and coordinates


data access.

3. Clients interact with HDFS through APIs to read, write, and


delete files.

4. DataNodes send heartbeats to the NameNode to report status


and block information.

5. HDFS is rack-aware and places replicas on different racks for


fault tolerance. Checksums are used for data integrity to ensure
the accuracy of stored data.
Yarn
• YARN (Yet Another Resource Negotiator). YARN helps manage
resources across the cluster.

• It has 3 main components:


ASS.PRO.UPEKSHA CHAUDHRI 4
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

o Resource Manager

o Node Manager

o Application Manager

• Resource manager allocates resources for applications in the


system.

• Node manager allocates resources such as CPU, bandwidth per


machine. After allocation, it is later acknowledged to the
resource manager.

• The application manager and node manager perform


negotiations according to the requirements.

Yarn Architecture

Key points about YARN architecture are:


• Distributed Resource managers have the privilege of allocating
resources to applications in the system.

• Node managers work on allocating resources such as CPU,


memory, and bandwidth per machine, and later credit resource
managers.

• The Application Manager acts as an interface between the


Resource Manager and the Node Manager, negotiating their
needs.
MapReduce
• MapReduce is a programming model and processing framework
that enables parallel processing of large data sets.

• MapReduce can work with big data. It splits tasks into smaller
parts called mapping and reducing, which can be done
simultaneously.

• Map tasks process data and produce intermediate results.

• The intermediate results are then combined by a reduction task


to produce the final output.

ASS.PRO.UPEKSHA CHAUDHRI 5
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• MapReduce makes use of two functions Map() and Reduce()


o Map() sorts and filters data, thereby organizing it into
groups.

o A Map produces results based on key-value pairs,


which are later processed by the Reduce() method.

o Reduce() performs summarization by aggregating


related data.

•Simply put, Reduce() takes the output produced by Map() as


input and combines these tuples into a set of smaller tuples.
Apache Hive
• Hive provides a data warehousing infrastructure based on
Hadoop.

• It provides a SQL-like query language called HiveQL. It allows


users to query, analyze and manage large datasets stored in
Hadoop.

•Hive turns queries into MapReduce or other execution engines.


Thus, enabling data summarization, ad-hoc queries, and data
analysis.
Apache Pig
• Pig is a high-level scripting language and platform for simplifying
data processing tasks in Hadoop.

• It provides a language called Pig Latin. It allows users to express


data transformations and analytical operations.

• Pig optimizes these operations and transforms them into


MapReduce jobs for execution.
HBase
• HBase is a distributed columnar NoSQL database that runs on
Hadoop.

• It provides real-time random read/write access to large datasets.

• HBase is suitable for applications that require low-latency access


to data.

ASS.PRO.UPEKSHA CHAUDHRI 6
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• For example: real-time analytics, time-series data, and Online


Transaction Processing Systems (OLTP).
Apache Spark
• Spark is a fast and versatile cluster computing system that
extends the capabilities of Hadoop.

• It offers in-memory processing, enabling faster data processing


and iterative analysis.

• Spark supports batch processing, real-time stream processing,


and interactive data analysis. Thus, making it a versatile tool in
the Hadoop Ecosystem.
Apache Kafka
• Kafka is a distributed streaming platform that enables the
ingestion and processing of real-time data streams.

• It provides a publish / subscribe model for streaming data. It


allows applications to process the generated data.

• Kafka is commonly used to build real-time data pipelines, event-


driven architectures, and streaming analytics applications.
Apache Sqoop
• Sqoop is a Hadoop tool that makes it easy to move data
between Hadoop and structured databases.

• This tool helps connect traditional databases with the Hadoop


Ecosystem.
Apache Flume
• Flume makes getting lots of live data easier and send it to
Hadoop.

• This helps add data from different places like log files, social
media, and sensors.
Hadoop Performance Optimization
Here are some methods for optimizing Hadoop Ecosystem in Big Data
Take Advantage of HDFS Block Size Optimization
• Configure the HDFS block size based on the typical size of your
data files.

• Larger block sizes improve performance for reading and writing


large files, while smaller block sizes are beneficial for smaller
files.
Optimize Data Replication Factor

ASS.PRO.UPEKSHA CHAUDHRI 7
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Adjust the replication factor based on the required fault


tolerance and cluster storage capacity.

• A lower replication factor reduces storage overhead and


improves performance but at the cost of lower fault tolerance.
Optimize Your Network Settings
• Configure network settings such as network buffers and TCP
settings. It helps to maximize data transfer speeds between
nodes in your cluster.

• Hadoop performance improves when network bandwidth


increases and latency decreases.
Increase Concurrency
• Split large computing tasks into smaller, parallelizable tasks to
make optimal use of your cluster's compute resources.

• This can be achieved by adjusting the number of mappers and


reducers in your MapReduce job.
Optimize Task Scheduling
• Configure the Hadoop scheduler. For e.g.: Fair Scheduler or
Capacity Scheduler for efficient allocation.

• Fine-tuning the scheduling parameters ensures fair resource


allocation and maximizes cluster utilization.

Moving Data in and out of Hadoop –


Understanding inputs and outputs of MapReduce
MapReduce is a programming model and associated implementation for
processing and generating large datasets that can be parallelized across a
distributed cluster of computers. It consists of two main phases: the Map
phase and the Reduce phase. Let me explain the inputs and outputs for
each phase:

### Map Phase:

**Input:**

ASS.PRO.UPEKSHA CHAUDHRI 8
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

- **Key-Value pairs:** The input data is divided into chunks, and each
chunk is represented as a key-value pair. Typically, the key is used to
identify the data record, and the value contains the actual data.

**Processing:**
- **Mapper function:** A user-defined function called the "mapper" is
applied to each key-value pair independently. The mapper function takes
the input key-value pair and emits intermediate key-value pairs based on
the processing logic. It can filter, transform, or extract information from the
input data.

**Output:**
- **Intermediate Key-Value pairs:** The mapper function generates
intermediate key-value pairs as its output. These key-value pairs are
usually different from the input key-value pairs and are emitted based on
the logic defined in the mapper function. The intermediate key-value pairs
are grouped by key and shuffled across the cluster to prepare for the next
phase.

### Reduce Phase:

**Input:**
- **Grouped Key-Value pairs:** The intermediate key-value pairs generated
by the map phase are shuffled and grouped based on their keys. All
intermediate values associated with the same key are collected together
and passed to the reducer function.

**Processing:**
- **Reducer function:** A user-defined function called the "reducer" is
applied to each group of intermediate values sharing the same key. The
reducer function aggregates, summarizes, or processes these values to
produce the final output.

ASS.PRO.UPEKSHA CHAUDHRI 9
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

**Output:**
- **Final Output Key-Value pairs:** The reducer function generates the final
output key-value pairs based on the processing logic. These key-value
pairs constitute the result of the MapReduce job and typically represent
the desired computation or analysis performed on the input data.

In summary, the inputs to MapReduce are the initial dataset represented as


key-value pairs, and the outputs are the final processed results also
represented as key-value pairs, with intermediate processing stages in
between.
Data Serialization:-

What is Serialization?
Serialization is the process of converting a data object—a
combination of code and data represented within a region of data
storage—into a series of bytes that saves the state of the object in
an easily transmittable form. In this serialized form, the data can be
delivered to another data store (such as an in-memory computing
platform), application, or some other destination.

Data serialization is the process of converting an object into a


stream of bytes to more easily save or transmit it.
The reverse process—constructing a data structure or object from a
series of bytes—is deserialization. The deserialization process
recreates the object, thus making the data easier to read and modify
as a native structure in a programming language.

ASS.PRO.UPEKSHA CHAUDHRI 10
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Serialization and deserialization work together to transform/recreate


data objects to/from a portable format.
Serialization enables us to save the state of an object and recreate
the object in a new location. Serialization encompasses both the
storage of the object and exchange of data. Since objects are
composed of several components, saving or delivering all the parts
typically requires significant coding effort, so serialization is a
standard way to capture the object into a sharable format.

With serialization, we can transfer objects:

• Over the wire for messaging use cases


• From application to application via web services such as REST
APIs
• Through firewalls (as JSON or XML strings)
• Across domains
• To other data stores
• To identify changes in data over time
• While honoring security and user-specific details across
applications

Why Is Data Serialization Important for Distributed Systems?


In some distributed systems, data and its replicas are stored in
different partitions on multiple cluster members. If data is not
present on the local member, the system will retrieve that data from
another member. This requires serialization for use cases such as:

• Adding key/value objects to a map


• Putting items into a queue, set, or list
• Sending a lambda functions to another server
• Processing an entry within a map
• Locking an object
ASS.PRO.UPEKSHA CHAUDHRI 11
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Sending a message to a topic

What Are Common Languages for Data Serialization?


A number of popular object-oriented programming languages
provide either native support for serialization or have libraries that
add non-native capabilities for serialization to their feature set. Java,
.NET, C++, Node.js, Python, and Go, for example, all either have
native serialization support or integrate with serializer libraries.

Data formats such as JSON and XML are often used as the format for
storing serialized data. Customer binary formats are also used, which
tend to be more space-efficient due to less markup/tagging in the
serialization.
What Is Data Serialization in Big Data?
Big data systems often include technologies/data that are described
as “schemaless.” This means that the managed data in these systems
are not structured in a strict format, as defined by a schema.
Serialization provides several benefits in this type of en vironment:

• Structure. By inserting some schema or criteria for a data


structure through serialization on read, we can avoid reading
data that misses mandatory fields, is incorrectly classified, or
lacks some other quality control requirement.
• Portability. Big data comes from a variety of systems and may
be written in a variety of languages. Serialization can provide
the necessary uniformity to transfer such data to other
enterprise systems or applications.
• Versioning. Big data is constantly changing. Serialization allows
us to apply version numbers to objects for lifecycle
management.

ASS.PRO.UPEKSHA CHAUDHRI 12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy