0% found this document useful (0 votes)

4 views

UNIT2 BDA

The document provides an overview of Apache Hadoop, an open-source framework for processing and analyzing large datasets, detailing its architecture and key components such as HDFS, YARN, and MapReduce. It also discusses the Hadoop Ecosystem, which includes tools like Hive, Pig, and HBase, and highlights their roles in data storage and processing. Additionally, the document covers performance optimization techniques and the importance of data serialization in distributed systems.

Uploaded by

Chaudhri Upeksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

UNIT2 BDA

Uploaded by

Chaudhri Upeksha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Unit-2 INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE

Big Data – Apache Hadoop & Hadoop EcoSystem,

Hadoop is an open source framework. It is provided by Apache to process

and analyze very huge volume of data. It is written in Java and currently used
by Google, Facebook, LinkedIn, Yahoo, Twitter etc.

Our Hadoop tutorial includes all topics of Big Data Hadoop with HDFS,
MapReduce, Yarn, Hive, HBase, Pig, Sqoop etc.

Apache Hadoop is an open-source software framework used for distributed

storage and processing of large datasets across clusters of computers using
simple programming models. It's designed to scale up from single servers
to thousands of machines, each offering local computation and storage.

Hadoop consists of four main modules:

1. Hadoop Common: A set of common utilities and libraries that support

other Hadoop modules.

2. Hadoop Distributed File System (HDFS): A distributed file system that

stores data across multiple machines. It provides high-throughput access to
application data and is designed to be fault-tolerant.

3. Hadoop YARN (Yet Another Resource Negotiator): A resource

management layer responsible for managing resources and scheduling
applications on the Hadoop cluster.

4. Hadoop MapReduce: A programming model and processing engine for

large-scale data processing. It allows users to write applications that process
large amounts of data in parallel across a distributed cluster.

Hadoop is widely used in industries such as finance, healthcare, advertising,

and social media for tasks like log processing, data warehousing, machine
learning, and more. It's known for its scalability, fault tolerance, and ability
to handle diverse types of data.

Hadoop Ecosystem
The Hadoop Ecosystem is a group of software tools and frameworks. It is
based on the core components of Apache Hadoop. It enables storing,
ASS.PRO.UPEKSHA CHAUDHRI 1
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

processing, and analyzing large amounts of data. It provides the

infrastructure needed to process large datasets. Hadoop distributes data
and processes tasks across clusters of computers.
Hadoop Ecosystem Components
The Hadoop Ecosystem is composed of several components. Each
component works together to enable the storage and analysis of data.

In the above diagram, we can see the components that collectively form a
Hadoop

Components Description

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

MapReduce Programming-based Data Processing

Spark InMemory Data Processing

PIG, HIVE Processing of data services on query-based.

ASS.PRO.UPEKSHA CHAUDHRI 2
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Components Description

HBase NoSQL Database

Mahout, Spark MLLib Machine Learning algorithm libraries

Zookeeper Managing cluster

Oozie Job Scheduling

Ecosystem.
Now we will learn about each of the components in detail.
Hadoop Distributed File System
• HDFS is the primary storage system in the Hadoop Ecosystem.

• It is a distributed file system that provides reliable and scalable

storage of large datasets across multiple computers.

• HDFS divides data into blocks and distributes them across the
cluster for fault tolerance and high availability.

• It consists of 2 basic components

o Node Name
o Data Node

• Node name is a primary Node. It contains metadata, requiring

comparatively free resources than Data Nodes that store the
actual data.

• It maintains all the coordination between clusters and hardware.

ASS.PRO.UPEKSHA CHAUDHRI 3
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

HDFS Architecture

The main purpose of HDFS is to ensure that data is preserved even in the
event of failures such as NameNode failures, DataNode failures, and
network partitions.
HDFS uses a master/slave architecture, where one device (master) controls
one or more other devices (slaves).
Important points about HDFS architecture:
1. Files are split into fixed-size chunks and replicated across
multiple DataNodes.

2. The NameNode contains file system metadata and coordinates

data access.

3. Clients interact with HDFS through APIs to read, write, and

delete files.

4. DataNodes send heartbeats to the NameNode to report status

and block information.

5. HDFS is rack-aware and places replicas on different racks for

fault tolerance. Checksums are used for data integrity to ensure
the accuracy of stored data.
Yarn
• YARN (Yet Another Resource Negotiator). YARN helps manage
resources across the cluster.

• It has 3 main components:

ASS.PRO.UPEKSHA CHAUDHRI 4
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

o Resource Manager

o Node Manager

o Application Manager

• Resource manager allocates resources for applications in the

system.

• Node manager allocates resources such as CPU, bandwidth per

machine. After allocation, it is later acknowledged to the
resource manager.

• The application manager and node manager perform

negotiations according to the requirements.

Yarn Architecture

Key points about YARN architecture are:

• Distributed Resource managers have the privilege of allocating
resources to applications in the system.

• Node managers work on allocating resources such as CPU,

memory, and bandwidth per machine, and later credit resource
managers.

• The Application Manager acts as an interface between the

Resource Manager and the Node Manager, negotiating their
needs.
MapReduce
• MapReduce is a programming model and processing framework
that enables parallel processing of large data sets.

• MapReduce can work with big data. It splits tasks into smaller
parts called mapping and reducing, which can be done
simultaneously.

• Map tasks process data and produce intermediate results.

• The intermediate results are then combined by a reduction task

to produce the final output.

ASS.PRO.UPEKSHA CHAUDHRI 5
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• MapReduce makes use of two functions Map() and Reduce()

o Map() sorts and filters data, thereby organizing it into
groups.

o A Map produces results based on key-value pairs,

which are later processed by the Reduce() method.

o Reduce() performs summarization by aggregating

related data.

•Simply put, Reduce() takes the output produced by Map() as

input and combines these tuples into a set of smaller tuples.
Apache Hive
• Hive provides a data warehousing infrastructure based on
Hadoop.

• It provides a SQL-like query language called HiveQL. It allows

users to query, analyze and manage large datasets stored in
Hadoop.

•Hive turns queries into MapReduce or other execution engines.

Thus, enabling data summarization, ad-hoc queries, and data
analysis.
Apache Pig
• Pig is a high-level scripting language and platform for simplifying
data processing tasks in Hadoop.

• It provides a language called Pig Latin. It allows users to express

data transformations and analytical operations.

• Pig optimizes these operations and transforms them into

MapReduce jobs for execution.
HBase
• HBase is a distributed columnar NoSQL database that runs on
Hadoop.

• It provides real-time random read/write access to large datasets.

• HBase is suitable for applications that require low-latency access

to data.

ASS.PRO.UPEKSHA CHAUDHRI 6
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• For example: real-time analytics, time-series data, and Online

Transaction Processing Systems (OLTP).
Apache Spark
• Spark is a fast and versatile cluster computing system that
extends the capabilities of Hadoop.

• It offers in-memory processing, enabling faster data processing

and iterative analysis.

• Spark supports batch processing, real-time stream processing,

and interactive data analysis. Thus, making it a versatile tool in
the Hadoop Ecosystem.
Apache Kafka
• Kafka is a distributed streaming platform that enables the
ingestion and processing of real-time data streams.

• It provides a publish / subscribe model for streaming data. It

allows applications to process the generated data.

• Kafka is commonly used to build real-time data pipelines, event-

driven architectures, and streaming analytics applications.
Apache Sqoop
• Sqoop is a Hadoop tool that makes it easy to move data
between Hadoop and structured databases.

• This tool helps connect traditional databases with the Hadoop

Ecosystem.
Apache Flume
• Flume makes getting lots of live data easier and send it to
Hadoop.

• This helps add data from different places like log files, social
media, and sensors.
Hadoop Performance Optimization
Here are some methods for optimizing Hadoop Ecosystem in Big Data
Take Advantage of HDFS Block Size Optimization
• Configure the HDFS block size based on the typical size of your
data files.

• Larger block sizes improve performance for reading and writing

large files, while smaller block sizes are beneficial for smaller
files.
Optimize Data Replication Factor

ASS.PRO.UPEKSHA CHAUDHRI 7
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Adjust the replication factor based on the required fault

tolerance and cluster storage capacity.

• A lower replication factor reduces storage overhead and

improves performance but at the cost of lower fault tolerance.
Optimize Your Network Settings
• Configure network settings such as network buffers and TCP
settings. It helps to maximize data transfer speeds between
nodes in your cluster.

• Hadoop performance improves when network bandwidth

increases and latency decreases.
Increase Concurrency
• Split large computing tasks into smaller, parallelizable tasks to
make optimal use of your cluster's compute resources.

• This can be achieved by adjusting the number of mappers and

reducers in your MapReduce job.
Optimize Task Scheduling
• Configure the Hadoop scheduler. For e.g.: Fair Scheduler or
Capacity Scheduler for efficient allocation.

• Fine-tuning the scheduling parameters ensures fair resource

allocation and maximizes cluster utilization.

Moving Data in and out of Hadoop –

Understanding inputs and outputs of MapReduce
MapReduce is a programming model and associated implementation for
processing and generating large datasets that can be parallelized across a
distributed cluster of computers. It consists of two main phases: the Map
phase and the Reduce phase. Let me explain the inputs and outputs for
each phase:

### Map Phase:

**Input:**

ASS.PRO.UPEKSHA CHAUDHRI 8
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

- **Key-Value pairs:** The input data is divided into chunks, and each
chunk is represented as a key-value pair. Typically, the key is used to
identify the data record, and the value contains the actual data.

**Processing:**
- **Mapper function:** A user-defined function called the "mapper" is
applied to each key-value pair independently. The mapper function takes
the input key-value pair and emits intermediate key-value pairs based on
the processing logic. It can filter, transform, or extract information from the
input data.

**Output:**
- **Intermediate Key-Value pairs:** The mapper function generates
intermediate key-value pairs as its output. These key-value pairs are
usually different from the input key-value pairs and are emitted based on
the logic defined in the mapper function. The intermediate key-value pairs
are grouped by key and shuffled across the cluster to prepare for the next
phase.

### Reduce Phase:

**Input:**
- **Grouped Key-Value pairs:** The intermediate key-value pairs generated
by the map phase are shuffled and grouped based on their keys. All
intermediate values associated with the same key are collected together
and passed to the reducer function.

**Processing:**
- **Reducer function:** A user-defined function called the "reducer" is
applied to each group of intermediate values sharing the same key. The
reducer function aggregates, summarizes, or processes these values to
produce the final output.

ASS.PRO.UPEKSHA CHAUDHRI 9
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

**Output:**
- **Final Output Key-Value pairs:** The reducer function generates the final
output key-value pairs based on the processing logic. These key-value
pairs constitute the result of the MapReduce job and typically represent
the desired computation or analysis performed on the input data.

In summary, the inputs to MapReduce are the initial dataset represented as

key-value pairs, and the outputs are the final processed results also
represented as key-value pairs, with intermediate processing stages in
between.
Data Serialization:-

What is Serialization?
Serialization is the process of converting a data object—a
combination of code and data represented within a region of data
storage—into a series of bytes that saves the state of the object in
an easily transmittable form. In this serialized form, the data can be
delivered to another data store (such as an in-memory computing
platform), application, or some other destination.

Data serialization is the process of converting an object into a

stream of bytes to more easily save or transmit it.
The reverse process—constructing a data structure or object from a
series of bytes—is deserialization. The deserialization process
recreates the object, thus making the data easier to read and modify
as a native structure in a programming language.

ASS.PRO.UPEKSHA CHAUDHRI 10
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Serialization and deserialization work together to transform/recreate

data objects to/from a portable format.
Serialization enables us to save the state of an object and recreate
the object in a new location. Serialization encompasses both the
storage of the object and exchange of data. Since objects are
composed of several components, saving or delivering all the parts
typically requires significant coding effort, so serialization is a
standard way to capture the object into a sharable format.

With serialization, we can transfer objects:

• Over the wire for messaging use cases

• From application to application via web services such as REST
APIs
• Through firewalls (as JSON or XML strings)
• Across domains
• To other data stores
• To identify changes in data over time
• While honoring security and user-specific details across
applications

Why Is Data Serialization Important for Distributed Systems?

In some distributed systems, data and its replicas are stored in
different partitions on multiple cluster members. If data is not
present on the local member, the system will retrieve that data from
another member. This requires serialization for use cases such as:

• Adding key/value objects to a map

• Putting items into a queue, set, or list
• Sending a lambda functions to another server
• Processing an entry within a map
• Locking an object
ASS.PRO.UPEKSHA CHAUDHRI 11
BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

• Sending a message to a topic

What Are Common Languages for Data Serialization?

A number of popular object-oriented programming languages
provide either native support for serialization or have libraries that
add non-native capabilities for serialization to their feature set. Java,
.NET, C++, Node.js, Python, and Go, for example, all either have
native serialization support or integrate with serializer libraries.

Data formats such as JSON and XML are often used as the format for
storing serialized data. Customer binary formats are also used, which
tend to be more space-efficient due to less markup/tagging in the
serialization.
What Is Data Serialization in Big Data?
Big data systems often include technologies/data that are described
as “schemaless.” This means that the managed data in these systems
are not structured in a strict format, as defined by a schema.
Serialization provides several benefits in this type of en vironment:

• Structure. By inserting some schema or criteria for a data

structure through serialization on read, we can avoid reading
data that misses mandatory fields, is incorrectly classified, or
lacks some other quality control requirement.
• Portability. Big data comes from a variety of systems and may
be written in a variety of languages. Serialization can provide
the necessary uniformity to transfer such data to other
enterprise systems or applications.
• Versioning. Big data is constantly changing. Serialization allows
us to apply version numbers to objects for lifecycle
management.

ASS.PRO.UPEKSHA CHAUDHRI 12

Unit Iii
No ratings yet
Unit Iii
20 pages
Cutover Migration
No ratings yet
Cutover Migration
11 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
CLOUD.pdf
No ratings yet
CLOUD.pdf
138 pages
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
1.1.1
No ratings yet
1.1.1
30 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BIG DATA UNIT 2
No ratings yet
BIG DATA UNIT 2
277 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
hadoop.pptx
No ratings yet
hadoop.pptx
61 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
DOC-20250510-WA0005.
No ratings yet
DOC-20250510-WA0005.
84 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
HADOOP
No ratings yet
HADOOP
19 pages
data analyst
No ratings yet
data analyst
9 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
UNIT II
No ratings yet
UNIT II
30 pages
unit 2
No ratings yet
unit 2
9 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
UNIT 4
No ratings yet
UNIT 4
85 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
Hadoop
No ratings yet
Hadoop
12 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop
No ratings yet
Hadoop
5 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
HADOOP
No ratings yet
HADOOP
10 pages
Unit 2
No ratings yet
Unit 2
23 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
BDA Unit 2
No ratings yet
BDA Unit 2
52 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Chapter 4 - Big Data Tools, Techniques, and Systems
No ratings yet
Chapter 4 - Big Data Tools, Techniques, and Systems
19 pages
unit 5 bda (1)
No ratings yet
unit 5 bda (1)
8 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Unit 2
No ratings yet
Unit 2
73 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
PI System Architecture Planning and Implementation
100% (1)
PI System Architecture Planning and Implementation
225 pages
Mongodb QRC Booklet
No ratings yet
Mongodb QRC Booklet
47 pages
How To - Integrate OpManager With Cyberoam
No ratings yet
How To - Integrate OpManager With Cyberoam
7 pages
SAP Profit Center Accounting Master Data Configuration
0% (1)
SAP Profit Center Accounting Master Data Configuration
4 pages
Queries
No ratings yet
Queries
4 pages
Data Management in Clinical Research
No ratings yet
Data Management in Clinical Research
3 pages
Cheatsheet - MAQ Software
No ratings yet
Cheatsheet - MAQ Software
1 page
SampleManager-LIMS-SDMS-LES-brochure
No ratings yet
SampleManager-LIMS-SDMS-LES-brochure
8 pages
Semester 2 Module 6 Routing and Routing Protocols
No ratings yet
Semester 2 Module 6 Routing and Routing Protocols
54 pages
Dropbox de Exemplo
No ratings yet
Dropbox de Exemplo
3 pages
Asr Fact Sheet Business Intelligence
No ratings yet
Asr Fact Sheet Business Intelligence
2 pages
DC-2 Final
No ratings yet
DC-2 Final
12 pages
PT1
No ratings yet
PT1
21 pages
Director or VP or EVP or SVP or CIO or CTO
No ratings yet
Director or VP or EVP or SVP or CIO or CTO
3 pages
CS_3440_Graded_Quiz_Unit_3
No ratings yet
CS_3440_Graded_Quiz_Unit_3
7 pages
Defender Datasheet 69032
No ratings yet
Defender Datasheet 69032
2 pages
3 - Use MQTT's Built-In Stateful Awareness
No ratings yet
3 - Use MQTT's Built-In Stateful Awareness
2 pages
CUBE-baby Update Program Guidance
No ratings yet
CUBE-baby Update Program Guidance
11 pages
AAA LINUX Syntax and Notes
No ratings yet
AAA LINUX Syntax and Notes
39 pages
Errors While Sending Packages From OLTP To BI (One of Error at The Time of Data Loads Through Process Chains)
No ratings yet
Errors While Sending Packages From OLTP To BI (One of Error at The Time of Data Loads Through Process Chains)
19 pages
Counteract: 802.1X and Network Access Control: Technical Note
No ratings yet
Counteract: 802.1X and Network Access Control: Technical Note
15 pages
CV İstanbul (1) - Sayfalar-Silindi
No ratings yet
CV İstanbul (1) - Sayfalar-Silindi
2 pages
Best Resume For All
No ratings yet
Best Resume For All
12 pages
HP-UX VXFS Tuning - c01919408
No ratings yet
HP-UX VXFS Tuning - c01919408
32 pages
Michael Mbaya Resume
No ratings yet
Michael Mbaya Resume
2 pages
Guwada Internship
No ratings yet
Guwada Internship
27 pages
White Paper How To Choose Your Process Orchestration Technology ?
No ratings yet
White Paper How To Choose Your Process Orchestration Technology ?
21 pages
SAP ECS Business Continuity
No ratings yet
SAP ECS Business Continuity
30 pages
Software Constraints For Large Application Systems: The Computer Journal October 1997
No ratings yet
Software Constraints For Large Application Systems: The Computer Journal October 1997
20 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

UNIT2 BDA

Uploaded by

UNIT2 BDA

Uploaded by

BZ GROW MORE INSTITUTE OF MSC(CA&IT) SEM-8

Unit-2 INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE

Hadoop is an open source framework. It is provided by Apache to process

Apache Hadoop is an open-source software framework used for distributed

Hadoop consists of four main modules:

1. Hadoop Common: A set of common utilities and libraries that support

2. Hadoop Distributed File System (HDFS): A distributed file system that

3. Hadoop YARN (Yet Another Resource Negotiator): A resource

4. Hadoop MapReduce: A programming model and processing engine for

Hadoop is widely used in industries such as finance, healthcare, advertising,

processing, and analyzing large amounts of data. It provides the

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

MapReduce Programming-based Data Processing

Spark InMemory Data Processing

PIG, HIVE Processing of data services on query-based.

HBase NoSQL Database

Mahout, Spark MLLib Machine Learning algorithm libraries

Zookeeper Managing cluster

Oozie Job Scheduling

• It is a distributed file system that provides reliable and scalable

• It consists of 2 basic components

• Node name is a primary Node. It contains metadata, requiring

• It maintains all the coordination between clusters and hardware.

2. The NameNode contains file system metadata and coordinates

3. Clients interact with HDFS through APIs to read, write, and

4. DataNodes send heartbeats to the NameNode to report status

5. HDFS is rack-aware and places replicas on different racks for

• It has 3 main components:

• Resource manager allocates resources for applications in the

• Node manager allocates resources such as CPU, bandwidth per

• The application manager and node manager perform

Key points about YARN architecture are:

• Node managers work on allocating resources such as CPU,

• The Application Manager acts as an interface between the

• Map tasks process data and produce intermediate results.

• The intermediate results are then combined by a reduction task

• MapReduce makes use of two functions Map() and Reduce()

o A Map produces results based on key-value pairs,

o Reduce() performs summarization by aggregating

•Simply put, Reduce() takes the output produced by Map() as

• It provides a SQL-like query language called HiveQL. It allows

•Hive turns queries into MapReduce or other execution engines.

• It provides a language called Pig Latin. It allows users to express

• Pig optimizes these operations and transforms them into

• It provides real-time random read/write access to large datasets.

• HBase is suitable for applications that require low-latency access

• For example: real-time analytics, time-series data, and Online

• It offers in-memory processing, enabling faster data processing

• Spark supports batch processing, real-time stream processing,

• It provides a publish / subscribe model for streaming data. It

• Kafka is commonly used to build real-time data pipelines, event-

• This tool helps connect traditional databases with the Hadoop

• Larger block sizes improve performance for reading and writing

• Adjust the replication factor based on the required fault

• A lower replication factor reduces storage overhead and

• Hadoop performance improves when network bandwidth

• This can be achieved by adjusting the number of mappers and

• Fine-tuning the scheduling parameters ensures fair resource

Moving Data in and out of Hadoop –

### Map Phase:

### Reduce Phase:

In summary, the inputs to MapReduce are the initial dataset represented as

Data serialization is the process of converting an object into a

Serialization and deserialization work together to transform/recreate

With serialization, we can transfer objects:

• Over the wire for messaging use cases

Why Is Data Serialization Important for Distributed Systems?

• Adding key/value objects to a map

• Sending a message to a topic

What Are Common Languages for Data Serialization?

• Structure. By inserting some schema or criteria for a data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.