Unit 2
Unit 2
Unit 2
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make interaction
with big data easier, However, for those who are not acquainted with this technology,
one question arises that what is big data ? Big data is a term given to the data sets
which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries and
companies that need to work on large data sets which are sensitive and needs efficient
handling. Hadoop is a framework that enables processing of large data sets which
reside in the form of clusters. Being a framework, Hadoop is made up of several
modules that are supported by a large ecosystem of technologies.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty
of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge
data sets.
Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as HQL
(Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Hadoop – Pros and Cons
Big Data has become necessary as industries are growing, the goal is to congregate
information and finding hidden facts behind the data. Data defines how industries
can improve their activity and affair. A large number of industries are revolving
around the data, there is a large amount of data that is gathered and analyzed through
various processes with various tools. Hadoop is one of the tools to deal with this
huge amount of data as it can easily extract the information from data, Hadoop has
its Advantages and Disadvantages while we deal with Big Data.
Pros
1. Problem with Small files Hadoop can efficiently perform over a small number of
files of large size. Hadoop stores the file in the form of file blocks which are from
128MB in size(by default) to 256MB. Hadoop fails when it needs to access the small
size file in a large amount. This so many small files surcharge the Namenode and
make it difficult to work.
2. Vulnerability Hadoop is a framework that is written in java, and java is one of the
most commonly used programming languages which makes it more insecure as it
can be easily exploited by any of the cyber-criminal.
3. Low Performance In Small Data Surrounding Hadoop is mainly designed for
dealing with large datasets, so it can be efficiently utilized for the organizations that
are generating a massive volume of data. It’s efficiency decreases while performing
in small data surroundings.
4. Lack of Security Data is everything for an organization, by default the security
feature in Hadoop is made un-available. So the Data driver needs to be careful with
this security face and should take appropriate action on it. Hadoop uses Kerberos for
security feature which is not easy to manage. Storage and network encryption are
missing in Kerberos which makes us more concerned about it.
5. High Up Processing Read/Write operation in Hadoop is immoderate since we are
dealing with large size data that is in TB or PB. In Hadoop, the data read or write
done from the disk which makes it difficult to perform in-memory calculation and
lead to processing overhead or High up processing.
6. Supports Only Batch Processing The batch process is nothing but the processes
that are running in the background and does not have any kind of interaction with the
user. The engines used for these processes inside the Hadoop core is not that much
efficient. Producing the output with low latency is not possible with it.
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed
over several machines and replicated to ensure their durability to failure and high
availability to parallel application.
AD
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks
are 128 MB by default and this is configurable.Files n HDFS are broken into block-
sized chunks,which are stored as independent units.Unlike a file system, if the file is in
HDFS is smaller than block size, then it does not occupy full block?s size, i.e. 5 MB of
file stored in HDFS of block size 128 MB takes 5MB of space only.The HDFS block size
is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as
master.Name Node is controller and manager of HDFS as it knows the status and the
metadata of all the files in HDFS; the metadata information being file permission,
names and location of each block.The metadata are small, so it is stored in the
memory of name node,allowing faster access to data. Moreover the HDFS cluster is
accessed by multiple clients concurrently,so all this information is handled bya single
machine. The file system operations like opening, closing, renaming etc. are executed
by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name
node. They report back to name node periodically, with list of blocks that they are
storing. The data node being a commodity hardware also does the work of block
creation, deletion and replication as stated by the name node.
Since all the metadata is stored in name node, it is very important. If it fails the file
system can not be used as there would be no way of knowing how to reconstruct the
files from blocks present in data node. To overcome this, the concept of secondary
name node arises.
Secondary Name Node: It is a separate physical machine which acts as a helper of
name node. It performs periodic check points.It communicates with the name node
and take snapshot of meta data which helps minimize downtime and loss of data.
AD
1. **Storage:**
- **Hadoop Distributed File System (HDFS):** Hadoop stores data in a distributed file system called
HDFS. Data is divided into blocks and distributed across multiple nodes in the cluster for fault
tolerance and parallel processing.
2. **Processing Model:**
- **MapReduce:** The core processing model in Hadoop is MapReduce. It consists of two main
phases - the Map phase and the Reduce phase.
- **Map Phase:** Input data is divided into smaller chunks, and a "mapper" task processes each
chunk independently, generating key-value pairs as output.
- **Shuffle and Sort:** The framework then shuffles and sorts the intermediate key-value pairs,
grouping them by key.
- **Reduce Phase:** The "reducer" tasks process the sorted key-value pairs, aggregating and
producing the final output.
3. **Programming Model:**
- **MapReduce API:** Developers can write MapReduce programs using the Hadoop MapReduce
API, typically in Java. However, there are also other higher-level abstractions and languages available,
such as Apache Pig, Apache Hive, and Apache Spark, which simplify the development process.
4. **Job Submission:**
- **Hadoop Job Submission:** Once the MapReduce program is written, it needs to be packaged
into a JAR file and submitted to the Hadoop cluster. The Hadoop JobTracker manages the execution
of the job across the cluster.
- **Apache Hive:** Provides a SQL-like interface for querying and managing large datasets stored in
Hadoop.
- **Apache Pig:** A high-level scripting language for expressing data analysis programs that are
translated into MapReduce jobs.
- **Hadoop Ecosystem Tools:** Various tools such as Apache Ambari, Apache Hadoop YARN, and
Apache Hadoop MapReduce provide monitoring, resource management, and job tracking
capabilities.
- **Data Ingestion:** Hadoop can ingest data from various sources, and tools like Apache Flume or
Apache Kafka can be used for efficient data ingestion.
- **Data Export:** Data processed in Hadoop can be exported to other systems or databases for
further analysis or reporting.
8. **Scaling:**
- **Horizontal Scaling:** Hadoop allows for easy scaling by adding more nodes to the cluster,
providing the ability to handle increasing amounts of data and processing requirements.
Keep in mind that Hadoop has evolved over time, and new technologies may have emerged since my
last update. Always refer to the latest documentation for the most up-to-date information.
Advantages :
Disadvantages :
Mapper class takes the input, tokenizes it, maps and sorts it. The
output of Mapper class is used as input by Reducer class, which in
turn searches matching pairs and reduces them.
Sorting
Searching
Indexing
TF-IDF
Sorting
Example
The Map phase processes each input file and provides the
employee data in key-value pairs (<k, v> : <emp name,
salary>). See the following illustration.
The combiner phase (searching technique) will accept the input
from the Map phase as a key-value pair with employee
name and salary. Using searching technique, the combiner
will check all the employee salary to find the highest salaried
employee in each file. See the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max
salary
else{
Continue checking;
}
Reducer phase − Form each file, you will find the highest
salaried employee. To avoid redundancy, check all the <k,
v> pairs and eliminate duplicate entries, if any. The same
algorithm is used in between the four <k, v> pairs, which
are coming from four input files. The final output should be
as follows −
<gopal, 50000>
Indexing
The following text is the input for inverted indexing. Here T[0],
T[1], and t[2] are the file names and their content are in double
quotes.
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file.
Similarly, "is": {0, 1, 2} implies the term "is" appears in the files
T[0], T[1], and T[2].
TF-IDF
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in the
document)
Inverse Document Frequency (IDF)
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).
Example
Consider a document containing 1000 words, wherein the
word hive appears 50 times. The TF for hive is then (50 / 1000) =
0.05.
Now, assume we have 10 million documents and the
word hive appears in 1000 of these. Then, the IDF is calculated as
log(10,000,000 / 1,000) = 4.