0% found this document useful (0 votes)
68 views

Hadoop

Hadoop is an open-source software framework used for storing and processing large datasets in a distributed computing environment. It uses HDFS for storage and MapReduce for processing. Key components include HDFS, MapReduce, and YARN. Hadoop was created by Doug Cutting and Mike Cafarella in 2006 based on a 2003 Google paper. It allows for scalable, fault-tolerant processing of large datasets across clusters of computers.

Uploaded by

Inu Kag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Hadoop

Hadoop is an open-source software framework used for storing and processing large datasets in a distributed computing environment. It uses HDFS for storage and MapReduce for processing. Key components include HDFS, MapReduce, and YARN. Hadoop was created by Doug Cutting and Mike Cafarella in 2006 based on a 2003 Google paper. It allows for scalable, fault-tolerant processing of large datasets across clusters of computers.

Uploaded by

Inu Kag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Addis Abeba Science and Technology University (AASTU)

Group Assignment for the course Emerging Technology

Course code: EMTE1008

Title: Hadoop

Section:25

Group 5
NAMES ID
1.DAGIM ABEBE ETS0342/15
2.KIDUS WENDESEN ETS0782/15
3.BIRUK ENDALKACHEW ETS1526/15
4.YOHANNES GEZAHEGN ETS1437/15
5.KIDUS YOSEF TESFAYE ETS0784/15
6.HERMELA ASHAGRE S ETS0679/15
7.DEBORA YOHANNES ETS0385/15
8.LIYA ROBEL ETS0839/15

Instructor: Mr.Abera
Submission date:4/9/2023
Introduction
This paper is intended to educate readers by exploring Hadoop, an open-source
framework designed to handle and process large volumes of data. First, it begins by
defining Hadoop and its significance in the realm of big data analytics. The project
then delves into the various components of Hadoop, including the Hadoop Distributed
File system (HDFS). It also offers a historical overview, tracing its evolution and
impact on the field of data processing.
By examining and analyzing these key aspects in depth, our projects aim to provide a
comprehensive understanding of Hadoop’s role in managing and analyzing big data
aided with illustrative examples, it is envisioned to shed light in many ways.
We have been able to reach the completion of this assignment with the collaboration
of various information sources, which have all been listed in the reference. The
method we employed to finish this assignment was mainly comprises of synergy
accompanied by a responsible execution of each of our tasks.

1
What is Hadoop?
Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.
Components of Hadoop
1. HDFS: Hadoop Distributed File System is a dedicated file system to store big
data with a cluster of commodity hardware or cheaper hardware with streaming
access pattern. It enables data to be stored at multiple nodes in the cluster which
ensures data security and fault tolerance.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon.
Now suppose a query is sent to process a data set in the HDFS. Now, Hadoop
identifies where this data is stored, this is called Mapping. Now the query is
broken into multiple parts and the results of all these multiple parts are
combined and the overall result is sent back to the user. This is called reduce
process. Thus while HDFS is used to store the data, Map Reduce is used to
process the data.
3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated
operating system for Hadoop which manages the resources of the cluster and
also functions as a framework for job scheduling in Hadoop. The various types
of scheduling are First Come First Serve, Fair Share Scheduler and Capacity
Scheduler etc. The First Come First Serve scheduling is set by default in
YARN.

 Hadoop also includes several additional modules that provide additional


functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).

 Hadoop is commonly used in big data scenarios such as data warehousing,


business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the distributed processing of large data
sets across clusters of computers using a simple programming model.

History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on
his son’s toy elephant. In October 2003 the first paper release was Google File
System. In January 2006, MapReduce development started on the Apache Nutch
which consisted of around 6000 lines coding for it and around 5000 lines coding for
HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data.
It was created by Apache Software Foundation in 2006, based on a white paper
written by Google in 2003 that described the Google File System (GFS) and the

2
MapReduce programming model. The Hadoop framework allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. It is used by many organizations, including
Yahoo, Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has
become a key technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data
processing:

 Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on
the same node where it will be processed, this feature helps to reduce the network
traffic and improve the performance
 High Availability: Hadoop provides High Availability feature, which helps to
make sure that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for
the processing of data in a distributed fashion, making it easy to implement a
wide variety of data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure
that the data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which
helps to reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run
and process data stored in HDFS.

Hadoop Distributed File System


It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which
are facilitated by HDFS.

3
HDFS
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably,
ability to tolerate faults, scalable, block structured, can process a large amount of
data simultaneously and many more. Disadvantages of HDFS: It’s the biggest
disadvantage is that it is not fit for small quantities of data. Also, it has issues
related to potential stability, restrictive and rough in nature. Hadoop also supports a
wide range of software packages such as Apache Flumes, Apache Oozie, Apache
HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache Hive,
Apache Phoenix, Cloudera Impala.
Some common frameworks of Hadoop
1. Hive- It uses HiveQl for data structuring and for writing complicated MapReduce
in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of their
codes faster.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.

4
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
Advantages and Disadvantages of Hadoop
Advantages:
 Ability to store a large amount of data.
 High flexibility.
 Cost effective.
 High computational power.
 Tasks are independent.
 Linear scaling.

Hadoop has several advantages that make it a popular choice for big data
processing:

 Economically Feasible: It is cheaper to store data and process it than it was in


the traditional approach. Since the actual machines used to store data are only
commodity hardware.
 Easy to Use: The projects or set of tools provided by Apache Hadoop are easy to
work upon in order to analyze complex data sets.
 Open Source: Since Hadoop is distributed as an open source software under
Apache License, so one does not need to pay for it, just download it and use it.
 Fault Tolerance: Since Hadoop stores three copies of data, so even if one copy
is lost because of any commodity hardware failure, the data is safe. Moreover, as
Hadoop version 3 has multiple name nodes, so even the single point of failure of
Hadoop has also been removed.
 Scalability: Hadoop is highly scalable in nature. If one needs to scale up or scale
down the cluster, one only needs to change the number of commodity hardware
in the cluster.
 Distributed Processing: HDFS and Map Reduce ensures distributed storage and
processing of the data.
 Locality of Data: This is one of the most alluring and promising features of
Hadoop. In Hadoop, to process a query over a data set, instead of bringing the
data to the local computer we send the query to the server and fetch the final
result from there. This is called data locality.
a Hadoop cluster is also a collection of various commodity hardware(devices that
are inexpensive and amply available). This Hardware components work together as
a single unit. In the Hadoop cluster, there are lots of nodes (can be computer and
servers) contains Master and Slaves, the Name node and Resource Manager works
as Master and data node, and Node Manager works as a Slave. The purpose of
Master nodes is to guide the slave nodes in a single Hadoop cluster. We design
Hadoop clusters for storing, analyzing, understanding, and for finding the facts that
are hidden behind the data or datasets which contain some crucial information. The
Hadoop cluster stores different types of data and processes them.
 Structured-Data: The data which is well structured like Mysql.
 Semi-Structured Data: The data which has the structure but not the data type
like XML, Json (Javascript object notation).
 Unstructured Data: The data that doesn’t have any structure like audio, video.
Hadoop Cluster Schema:

5
Hadoop Clusters Properties

issue of single point failure by having multiple name nodes. Various other
advantages like erasure coding, use of GPU hardware and Dockers makes it superior
to the earlier versions of Hadoop.

Disadvantages:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues.
 Security concerns.
 Complexity: Hadoop can be complex to set up and maintain, especially for
organizations without a dedicated team of experts.
 Latency: Hadoop is not well-suited for low-latency workloads and may not be the
best choice for real-time data processing.
 Limited Support for Real-time Processing: Hadoop’s batch-oriented nature
makes it less suited for real-time streaming or interactive data processing use
cases.
 Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured data
processing
 Data Security: Hadoop does not provide built-in security features such as data
encryption or user authentication, which can make it difficult to secure sensitive
data.
 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model
is not well-suited for ad-hoc queries, making it difficult to perform exploratory
data analysis.

6
 Limited Support for Graph and Machine Learning: Hadoop’s core component
HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are available
but have some limitations.
 Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
 Data Loss: In the event of a hardware failure, the data stored in a single node
may be lost permanently.
 Data Governance: Data Governance is a critical aspect of data management,
Hadoop does not provide a built-in feature to manage data lineage, data quality,
data cataloging, data lineage, and data audit.

Evolution of Hadoop
Hadoop was designed by Doug Cutting and Michael Cafarella in 2005. The design
of Hadoop is inspired by Google. Hadoop stores the huge amount of data through a
system called Hadoop Distributed File System (HDFS) and processes this data with
the technology of Map Reduce. The designs of HDFS and Map Reduce are inspired
by the Google File System (GFS) and Map Reduce. In the year 2000 Google
suddenly overtook all existing search engines and became the most popular and
profitable search engine. The success of Google was attributed to its unique Google
File System and Map Reduce. No one except Google knew about this, till that time.
So, in the year 2003 Google released some papers on GFS. But it was not enough to
understand the overall working of Google. So in 2004, Google again released the
remaining papers. The two enthusiasts Doug Cutting and Michael Cafarella studied
those papers and designed what is called, Hadoop in the year 2005. Doug’s son had
a toy elephant whose name was Hadoop and thus Doug and Michael gave their new
creation, the name “Hadoop” and hence the symbol “toy elephant.” This is how
Hadoop evolved. Thus the designs of HDFS and Map Reduced though created by
Doug Cutting and Michael Cafarella, but are originally inspired by Google. For
more details about the evolution of Hadoop, you can refer to Hadoop | History or
Evolution.
Traditional Approach: Suppose we want to process a data. In the traditional
approach, we used to store data on local machines. This data was then processed.
Now as data started increasing, the local machines or computers were not capable
enough to store this huge data set. So, data was then started to be stored on remote
servers. Now suppose we need to process that data. So, in the traditional approach,
this data has to be fetched from the servers and then processed upon. Suppose this
data is of 500 GB. Now, practically it is very complex and expensive to fetch this
data. This approach is also called Enterprise Approach.
In the new Hadoop Approach, instead of fetching the data on local machines we
send the query to the data. Obviously, the query to process the data will not be as
huge as the data itself. Moreover, at the server, the query is divided into several
parts. All these parts process the data simultaneously. This is called parallel
execution and is possible because of Map Reduce. So, now not only there is no need
to fetch the data, but also the processing takes lesser time. The result of the query is
then sent to the user. Thus the Hadoop makes data storage, processing and analyzing
way easier than its traditional approach.

How the components of Hadoop make it as a solution for Big Data?

7
1. Hadoop Distributed File System: In our local PC, by default the block size in
Hard Disk is 4KB. When we install Hadoop, the HDFS by default changes the
block size to 64 MB. Since it is used to store huge data. We can also change the
block size to 128 MB. Now HDFS works with Data Node and Name Node. While
Name Node is a master service and it keeps the metadata as for on which
commodity hardware, the data is residing, the Data Node stores the actual data.
Now, since the block size is of 64 MB thus the storage required to store metadata
is reduced thus making HDFS better. Also, Hadoop stores three copies of every
dataset at three different locations. This ensures that the Hadoop is not prone to
single point of failure.
2. Map Reduce: In the simplest manner, it can be understood that MapReduce
breaks a query into multiple parts and now each part process the data coherently.
This parallel execution helps to execute a query faster and makes Hadoop a
suitable and optimal choice to deal with Big Data.
3. YARN: As we know that Yet Another Resource Negotiator works like an
operating system to Hadoop and as operating systems are resource managers so
YARN manages the resources of Hadoop so that Hadoop serves big data in a
better way.
Hadoop Versions: Till now there are three versions of Hadoop as follows.
 Hadoop 1: This is the first and most basic version of Hadoop. It includes Hadoop
Common, Hadoop Distributed File System (HDFS), and Map Reduce.

 Hadoop 2: The only difference between Hadoop 1 and Hadoop 2 is that Hadoop
2 additionally contains YARN (Yet Another Resource Negotiator). YARN helps
in resource management and task scheduling through its two daemons namely job
tracking and progress monitoring.

8
 Hadoop 3: This is the recent version of Hadoop. Along with the merits of the
first two versions, Hadoop 3 has one most important merit. It has resolved the

9
Conclusion
To conclude, Hadoop is a powerful open-source software framework that enables
the storage and processing of large amounts of data in a distributed computing
environment. Its components, including the Hadoop Distributed File System
(HDFS), MapReduce, and YARN, work together to provide efficient data storage,
processing, and resource management. Hadoop's flexibility and scalability make it
suitable for various big data applications, ranging from data warehousing to
machine learning. Additionally, Hadoop's history traces back to Google's early work
on the Google File System and MapReduce, and it has since been developed and
maintained by the Apache Software Foundation. With its widespread adoption and
use in industry, Hadoop has emerged as a critical technology for handling and
analyzing big data.

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy