0% found this document useful (0 votes)

5 views

BDAunit-II

Hadoop is an open-source framework for storing and processing large data volumes in a distributed computing environment, ensuring scalability, fault tolerance, and efficient data handling. It utilizes the Hadoop Distributed File System (HDFS) and the MapReduce processing model to manage both structured and unstructured data. The ecosystem includes various tools like Hive, Pig, and Spark, enhancing its capabilities for big data analytics.

Uploaded by

bhargavrajvaranasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

BDAunit-II

Uploaded by

bhargavrajvaranasi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

UNIT -2

Hadoop
Deﬁni on: Hadoop is an open-source framework designed for storing and processing large volumes
of data in a distributed compu ng environment. It enables scalable and eﬃcient data handling by
u lizing clusters of commodity hardware. The Hadoop framework allows parallel processing and fault
tolerance, making it a powerful tool for managing big data.

Introduc on: In today’s digital world, the volume of data generated is enormous, requiring systems
that can handle, store, and analyze large datasets eﬃciently. Tradi onal database management
systems (DBMS) struggle with scalability and performance when dealing with massive amounts of
data. Hadoop was developed as a solu on to these challenges, providing a distributed compu ng
model that processes large datasets across mul ple nodes simultaneously. By leveraging the Hadoop
Distributed File System (HDFS) and the MapReduce processing paradigm, Hadoop ensures high
availability, fault tolerance, and eﬃcient handling of structured and unstructured data.

Features of Hadoop: Hadoop is known for its unique capabili es that make it an essen al tool for big
data analy cs. The key features include:

1. Scalability: Hadoop can easily scale horizontally by adding more nodes to a cluster, allowing
it to handle increasing amounts of data.

2. Fault Tolerance: It ensures data reliability by replica ng data across mul ple nodes. If a node
fails, another node takes over its tasks.

3. Distributed Storage: Hadoop uses HDFS to distribute and store data across mul ple
machines, improving accessibility and eﬃciency.

4. Parallel Processing: The framework processes data in parallel using the MapReduce
programming model, boos ng speed and performance.

5. Flexibility: It supports structured, semi-structured, and unstructured data, including text,

images, and videos.

6. Cost-Eﬀec veness: Hadoop operates on inexpensive commodity hardware, reducing

infrastructure costs.

7. High Availability: Even if some nodes fail, the system con nues func oning without loss of
data.

8. Security and Authen ca on: Hadoop includes authen ca on mechanisms such as Kerberos
for secure data access and processing.

Key Advantages of Hadoop: Hadoop provides numerous beneﬁts, making it one of the most widely
used big data technologies:

1. Eﬃcient Handling of Large Datasets: It can process petabytes of data eﬃciently by

distribu ng tasks across clusters.
2. Improved Performance: With parallel data processing, it signiﬁcantly reduces the me
required for data analysis.

3. Support for Diverse Data Types: Hadoop can handle data from various sources, including
logs, social media, and IoT devices.

4. Integra on with Other Technologies: It works seamlessly with Spark, Hive, Pig, and other big
data tools.

5. Automa c Data Replica on: Hadoop ensures data redundancy, making it reliable in the
event of hardware failures.

Versions of Hadoop: Hadoop has evolved over me, with mul ple versions improving its eﬃciency
and capabili es.

 Hadoop 1.x: The ﬁrst version introduced MapReduce for distributed data processing but had
limita ons in resource management.

 Hadoop 2.x: Introduced YARN (Yet Another Resource Nego ator) for be er resource
management and scalability.

 Hadoop 3.x: Brought enhancements such as erasure coding, improved memory

management, and containerized applica ons for be er performance.

Overview of the Hadoop Ecosystem: The Hadoop ecosystem consists of various tools that enhance
its capabili es. Some of the major components include:

1. HDFS (Hadoop Distributed File System): The storage layer that distributes data across
clusters.

2. MapReduce: A programming model for parallel processing of large datasets.

3. YARN (Yet Another Resource Nego ator): Manages compu ng resources in Hadoop clusters.

4. Hive: A data warehouse infrastructure that supports SQL-like queries for Hadoop data.

5. Pig: A scrip ng language used for processing large datasets.

6. HBase: A NoSQL database that provides real- me read/write access to large data.

7. Sqoop: A tool for transferring data between Hadoop and rela onal databases.

8. Flume: Used for collec ng and aggrega ng log data.

9. Oozie: A workﬂow scheduler for managing Hadoop jobs.

10. Spark: A fast and general-purpose engine for large-scale data processing.
Hadoop Distribu ons: Various companies provide Hadoop distribu ons with addi onal features and
enterprise support:

 Apache Hadoop: The oﬃcial open-source version maintained by the Apache So ware
Founda on.

 Cloudera Distribu on (CDH): Includes enterprise security, real- me analy cs, and a user-
friendly interface.

 Hortonworks Data Pla orm (HDP): Provides a fully open-source Hadoop ecosystem with
enhanced security.

 MapR Hadoop Distribu on: Oﬀers real- me processing, distributed ﬁle systems, and NoSQL
database integra on.

Need for Hadoop: Hadoop is essen al due to the increasing volume, velocity, and variety of data
being generated. Businesses require efficient solu ons to process, analyze, and derive insights from
large datasets. Tradi onal databases fail to scale efficiently, whereas Hadoop provides an open-
source, cost-effec ve, and scalable solu on for managing big data.

RDBMS vs. Hadoop: The following table highlights the key diﬀerences between tradi onal Rela onal
Database Management Systems (RDBMS) and Hadoop:

Feature RDBMS Hadoop

Data Structure Structured (Tables) Structured, Semi-structured, Unstructured

Processing Transac onal (OLTP) Batch Processing (MapReduce)

Schema Fixed Schema Schema-on-Read

Scalability Ver cal (Limited) Horizontal (Unlimited)

Cost Expensive Cost-eﬀec ve (Commodity Hardware)

Fault Tolerance Low High (Data Replica on)

Querying SQL NoSQL (Hive, Pig, Spark)

Speed Fast for Small Data Op mized for Large Data

Distributed Compu ng Challenges: Distributed compu ng in Hadoop comes with several challenges,
including:

1. Data Consistency: Ensuring all nodes have synchronized data.

2. Fault Tolerance Management: Handling failures while maintaining system integrity.

3. Network Latency: Communica on delays between nodes can impact performance.

4. Load Balancing: Distribu ng data and computa on evenly across nodes.

5. Security Risks: Implemen ng strong authen ca on and encryp on to prevent data

breaches.

History of Hadoop: Hadoop was inspired by Google's MapReduce and Google File System (GFS). It
was developed by Doug Cu ng and Mike Cafarella in 2006 under the Apache So ware Founda on.
Since its incep on, Hadoop has evolved into a comprehensive ecosystem that powers modern big
data applica ons.

HDFS (Hadoop Distributed File System): HDFS is the core storage component of Hadoop, designed to
store and manage vast amounts of data eﬃciently. It follows a Master-Slave Architecture:

1. NameNode (Master): Manages metadata and ﬁle system namespace.

2. DataNodes (Slaves): Store actual data in blocks and perform read/write opera ons.

HDFS provides fault tolerance, scalability, and high throughput, making it an integral part of the
Hadoop ecosystem.

Unit Iii
No ratings yet
Unit Iii
20 pages
Balalo Norbert Unit 10 Database Design & Development - Assignment
No ratings yet
Balalo Norbert Unit 10 Database Design & Development - Assignment
47 pages
HADOOP
No ratings yet
HADOOP
10 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Hadoop Main
No ratings yet
Hadoop Main
19 pages
INTRODUCTION TO DATA SCIENCE
No ratings yet
INTRODUCTION TO DATA SCIENCE
14 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
Assignment 5 (Hadoop)
No ratings yet
Assignment 5 (Hadoop)
1 page
Hadoop Components
No ratings yet
Hadoop Components
5 pages
BDA Unit-4 Part-1 HDFS,MapReduce
No ratings yet
BDA Unit-4 Part-1 HDFS,MapReduce
76 pages
bda unit 4-1
No ratings yet
bda unit 4-1
64 pages
Hadoop
No ratings yet
Hadoop
11 pages
UNIT II HADOOP WITH HDFS
No ratings yet
UNIT II HADOOP WITH HDFS
22 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Bda Unit 2
No ratings yet
Bda Unit 2
44 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
BIG Data_Unit_2
No ratings yet
BIG Data_Unit_2
24 pages
UNIT II
No ratings yet
UNIT II
30 pages
data analyst
No ratings yet
data analyst
9 pages
04
No ratings yet
04
23 pages
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
No ratings yet
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
24 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Unit 2
No ratings yet
Unit 2
23 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
No ratings yet
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
9 pages
Ha Do Op
No ratings yet
Ha Do Op
24 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
Hadoop in bigdata processing concept
No ratings yet
Hadoop in bigdata processing concept
2 pages
UNIT-2 BIG DATA
No ratings yet
UNIT-2 BIG DATA
10 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
13 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Unit 4 - Data Science - Www.rgpvnotes.in
No ratings yet
Unit 4 - Data Science - Www.rgpvnotes.in
18 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
BDA Unit 2 Q&A
No ratings yet
BDA Unit 2 Q&A
14 pages
Unit II Big Data
No ratings yet
Unit II Big Data
27 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
Big - Data - Analytics - Srii (2) - Read-Only
No ratings yet
Big - Data - Analytics - Srii (2) - Read-Only
11 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
HADOOP
No ratings yet
HADOOP
19 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
Bba Second Semester
No ratings yet
Bba Second Semester
21 pages
KCS501 Solution2022 - 23
No ratings yet
KCS501 Solution2022 - 23
25 pages
DBS Lab Manual
No ratings yet
DBS Lab Manual
39 pages
Microsoft Copilot_ o seu complemento de IA
No ratings yet
Microsoft Copilot_ o seu complemento de IA
9 pages
MYSQL DBA Course Content
100% (1)
MYSQL DBA Course Content
4 pages
Guideline For Using DbPoolConnector Incodelist SI Maps
No ratings yet
Guideline For Using DbPoolConnector Incodelist SI Maps
1 page
COP 4710: Database Systems Fall 2006
No ratings yet
COP 4710: Database Systems Fall 2006
51 pages
Distributed SQL: The Architecture Behind Mariadb Xpand: April 2021
No ratings yet
Distributed SQL: The Architecture Behind Mariadb Xpand: April 2021
19 pages
DBMS Experiments (1 - 18)
No ratings yet
DBMS Experiments (1 - 18)
22 pages
Dbms Seperate Notes
No ratings yet
Dbms Seperate Notes
74 pages
DBMS Unit-3 Notes
No ratings yet
DBMS Unit-3 Notes
23 pages
DBMS Exercise 1.2
No ratings yet
DBMS Exercise 1.2
10 pages
Datediff
No ratings yet
Datediff
4 pages
XI Computer Science Portfolio
No ratings yet
XI Computer Science Portfolio
18 pages
Software Engineering Lab 02 Prepared by Dang Minh Thang
No ratings yet
Software Engineering Lab 02 Prepared by Dang Minh Thang
16 pages
SSC209 - Top 10 Database Maintenance Best Practices
No ratings yet
SSC209 - Top 10 Database Maintenance Best Practices
38 pages
Database Programming With SQL Section 10 Quiz
No ratings yet
Database Programming With SQL Section 10 Quiz
20 pages
Bcs Database - Complete Reference 2022
No ratings yet
Bcs Database - Complete Reference 2022
109 pages
S21 RDBMS F19-MB & Aa LQ01
No ratings yet
S21 RDBMS F19-MB & Aa LQ01
2 pages
Snowflake - T
No ratings yet
Snowflake - T
108 pages
Introduction To Data Warehousing and Business Intelligence
No ratings yet
Introduction To Data Warehousing and Business Intelligence
72 pages
Krishna Kumar Resume
No ratings yet
Krishna Kumar Resume
1 page
Dbms 2
No ratings yet
Dbms 2
54 pages
Single-User vs. Multi-User System: Dbms - Module - 5 - Notes
No ratings yet
Single-User vs. Multi-User System: Dbms - Module - 5 - Notes
19 pages
DBMS Lab 08 by Ali Ahmad - 373181
No ratings yet
DBMS Lab 08 by Ali Ahmad - 373181
5 pages
Star Schema
100% (3)
Star Schema
45 pages
Top 50 SQL Questions
No ratings yet
Top 50 SQL Questions
15 pages
Basic SOQL Query 1730815763
No ratings yet
Basic SOQL Query 1730815763
11 pages
Ref Cursor Strong Weak
No ratings yet
Ref Cursor Strong Weak
7 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDAunit-II

Uploaded by

BDAunit-II

Uploaded by

UNIT -2

5. Flexibility: It supports structured, semi-structured, and unstructured data, including text,

6. Cost-Eﬀec veness: Hadoop operates on inexpensive commodity hardware, reducing

1. Eﬃcient Handling of Large Datasets: It can process petabytes of data eﬃciently by

 Hadoop 3.x: Brought enhancements such as erasure coding, improved memory

2. MapReduce: A programming model for parallel processing of large datasets.

5. Pig: A scrip ng language used for processing large datasets.

8. Flume: Used for collec ng and aggrega ng log data.

9. Oozie: A workﬂow scheduler for managing Hadoop jobs.

Feature RDBMS Hadoop

Data Structure Structured (Tables) Structured, Semi-structured, Unstructured

Processing Transac onal (OLTP) Batch Processing (MapReduce)

Schema Fixed Schema Schema-on-Read

Scalability Ver cal (Limited) Horizontal (Unlimited)

Cost Expensive Cost-eﬀec ve (Commodity Hardware)

Fault Tolerance Low High (Data Replica on)

Querying SQL NoSQL (Hive, Pig, Spark)

Speed Fast for Small Data Op mized for Large Data

1. Data Consistency: Ensuring all nodes have synchronized data.

2. Fault Tolerance Management: Handling failures while maintaining system integrity.

3. Network Latency: Communica on delays between nodes can impact performance.

4. Load Balancing: Distribu ng data and computa on evenly across nodes.

5. Security Risks: Implemen ng strong authen ca on and encryp on to prevent data

1. NameNode (Master): Manages metadata and ﬁle system namespace.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.