Hadoop Ecosystem
Hadoop Ecosystem
Hadoop Ecosystem
Introduction
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major elements. All
these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just
the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major segment
of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing both. Also, all the
SQL datatypes are supported by Hive thus, making the query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components: JDBC
Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and
classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
Features of Hadoop:
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an open-source project the
source-code is available online for anyone to understand it or make some modifications as
per their industry requirement.
Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which provides a cost-
efficient model, unlike traditional Relational databases that require expensive hardware and
high-end processors to deal with Big Data. The problem with traditional Relational databases
is that storing the Massive volume of data is not cost-effective, so the company’s started to
remove the Raw data. which may not result in the correct scenario of their business. Means
Hadoop provides us 2 main benefits with the cost one is it’s open-source means free to use
and the other is that it uses commodity hardware which is also inexpensive.
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and Videos) very
efficiently. This means it can easily process any kind of data independent of its structure
which makes it highly flexible. It is very much useful for enterprises as they can process large
datasets easily, so the businesses can use Hadoop to analyze valuable insights of data from
sources like social media, email, etc. With this flexibility, Hadoop can be used with log
processing, Data Warehousing, Fraud detection, etc.
Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing work
since it is managed by the Hadoop itself. Hadoop ecosystem is also very large comes up with
lots of tools like Hive, Pig, Spark, HBase, Mahout, etc.
Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the data locality
concept, the computation logic is moved near data rather than moving the data to the
computation logic. The cost of Moving data on HDFS is costliest and with the help of the data
locality concept, the bandwidth utilization in the system is minimized.
Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File
System). In DFS(Distributed File System) a large size file is broken into small size file blocks
then distributed among the Nodes available in a Hadoop cluster, as this massive number of
file blocks are processed parallelly which makes Hadoop faster, because of which it provides a
High-level performance as compared to the traditional DataBase Management Systems.
Advantages of Hadoop:
1. Hadoop is a highly scalable storage platform. Hence, it can store and distribute a huge
amount of data sets across hundreds of inexpensive servers.
2. Hadoop provides a cost-effective storage solution for businesses to exploding data sets.
3. Hadoop allows businesses to easily access new data sources and tap into various types of
data to generate value from that data. Hence, Hadoop derives valuable business insights
from data sources such as social media, email conversations.
4. Hadoop can be used for a wide range of purposes, including log processing, data
warehousing, consumer strategy analysis, and fraud detection.
5. Hadoop can handle unstructured as well as semi-structured data.
6. The main advantage of Hadoop is its fault tolerance. When data is sent to a specific node, the
data is also distributed to other nodes in the network, ensuring there is another copy
available for use in the event of failure.
7. Hadoop framework has built-in power and flexibility to do what not possible earlier.
8. The addition of more nodes to the Hadoop cluster provides more storage and computing
power. This feature eliminates the need to buy external hardware. Hence, it is a cheaper
solution.
9. The unique storage method of Hadoop is based on a distributed file system that effectively
maps data wherever the cluster is located. The data analysis devices are also on the same
servers where the data is located, resulting in much quicker processing of the data.
10. Hadoop helps in distributing data on different servers and must be prevented network
overloading.
Disadvantages of Hadoop
1. Hadoop is complex applications and it difficult to manage. The security of Hadoop is the main
concern, which is disabled by default due to sheer complexity. If whoever managing the
platform lacks to know how to enable it, your data could be a huge risk.
2. Talking about security, the own makeup of Hadoop makes it a dangerous proposition to
manage. The framework is written almost in Java which has been heavily exploited by
cybercriminals.
3. Hadoop does not have storage or network-level encryption.
4. Whenever, Hadoop operated by a single master it will cause difficulty in scaling.
5. Hadoop is not suitable for small and real-time data applications.
6. The Hadoop distributed file system lacks the ability to efficiently support the random reading
of small files, due to its high capacity design. Thus, it is not recommended for organizations
with small quantities of data.
7. Hadoop has had its fair share of stability issues like all open-source software. The
organizations are strongly recommended to make sure they are running the latest stable
version to avoid these issues.
8. The Apache Flume, Google’s own cloud dataflow are potential solutions and the ability to
enhance data collection, processing, and integration performance and reliability. In addition,
many organizations missing out on big benefits by using Hadoop alone.
9. The programming model of Hadoop is very restrictive.
10. Hadoop is a built-in redundancy duplicates data, therefore requiring more storage resources.
Uses of Hadoop:
1. It is used to detect and prevent cyber-attacks.
2. A financial company uses this technique to search the customers.
3. The most important uses are in customer requirement understanding.
4. Hadoop is used to developed cities and countries. In addition, it also gives proper guidelines
for buses, trains, and another way of transaction.
5. Hadoop is also used in the high frequency trading field. Many trading decisions are taken by
algorithm only.
6. It is used in understanding and optimizing business process.
7. Hadoop has various uses such as HDFS, Pig, Hive, Hbase, Spark, and Scala.
8. It can be used with languages like Java, Python, and Shell Scripting.
9. Hadoop has got uses like Big Data storage, Business Intelligence, Data Mining, and Analytics.