BDA Unit 3
BDA Unit 3
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major elements. All
these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
1
Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
● HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
● HDFS consists of two core components i.e.
1. Name node
2. Data Node
● Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
● HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.
YARN:
● Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
● Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
● Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:
● By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
● MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
2
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
● It is a platform for structuring the data flow, processing and analyzing huge data sets.
● Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
● Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
● Pig helps to achieve ease of programming and optimization and hence is a major segment
of the Hadoop Ecosystem.
HIVE:
● With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
● It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
● Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
● JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
● Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
● It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows invoking
algorithms as per our need with the help of its own libraries.
Apache Spark:
● It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
● It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
● Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
● It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
● At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
3
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:
● Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries, especially Lucene is based on Java which allows spell
check mechanism, as well. However, Lucene is driven by Solr.
● Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
● Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.
Parallel Computation Framework: MapReduce:
MapReduce is a processing technique built on divide and conquer algorithm. It is made of
two different tasks - Map and Reduce. While Map breaks different elements into tuples to
perform a job, Reduce collects and combines the output from Map task and fetches it.
What is MapReduce?
MapReduce is the processing engine of the Apache Hadoop that was directly derived from
the Google MapReduce. The MapReduce application is written basically in Java. It
conveniently computes huge amounts of data by the applications of mapping and reducing
steps in order to come up with the solution for the required problem. The mapping step takes
a set of data in order to convert it into another set of data by breaking the individual elements
into key/value pairs called tuples. The second step of reducing takes the output derived from
the mapping process and combines the data tuples into a smaller set of tuples.
MapReduce is a hugely parallel processing framework that can be easily scaled over
massive amounts of commodity hardware to meet the increased need for processing larger
amounts of data. Once you get the mapping and reducing tasks right all it needs a change in
the configuration in order to make it work on a larger set of data. This kind of extreme
scalability from a single node to hundreds and even thousands of nodes is what makes
MapReduce a top favorite among Big Data professionals worldwide.
4
● Can be integrated with SQL to facilitate parallel processing capability
Check out our blog on MapReduce examples for a detailed understanding of concepts.
MapReduce Architecture
The entire MapReduce process is a massively parallel processing setup where the
computation is moved to the place of the data instead of moving the data to the place of the
computation. This kind of approach helps to speed the process, reduce network congestion
and improves the efficiency of the overall process.
The entire computation process is broken down into the mapping, shuffling and reducing
stages.
Mapping Stage: This is the first step of the MapReduce and it includes the process of
reading the information from the Hadoop Distributed File System (HDFS). The data could
be in the form of a directory or a file. The input data file is fed into the mapper function one
line at a time. The mapper then processes the data and reduces it into smaller blocks of data.
Reducing Stage: The reducer phase can consist of multiple processes. In the shuffling
process, the data is transferred from the mapper to the reducer. Without the successful
shuffling of the data, there would be no input to the reducer phase. But the shuffling process
can start even before the mapping process has completed. Next, the data is sorting in order to
lower the time taken to reduce the data. The sorting actually helps the reducing process by
providing a cue when the next key in the sorted input data is distinct from the previous key.
The reduce task needs a specific key-value pair in order to call the reduce function that takes
the key-value as its input. The output from the reducer can be directly deployed to be stored
in the HDFS.
5
MapReduce Terminologies
● MasterNode – Place where JobTracker runs and which accepts job requests from
clients
● SlaveNode – It is the place where the mapping and reducing programs are run
● JobTracker – it is the entity that schedules the jobs and tracks the jobs assigned using
Task Tracker
● TaskTracker – It is the entity that actually tracks the tasks and provides the report
status to the JobTracker
● Job – A MapReduce job is the execution of the Mapper & Reducer program across a
dataset
● Task – the execution of the Mapper & Reducer program on a specific data section
● TaskAttempt – A particular task execution attempt on a SlaveNode
MapReduce is a programming model and processing engine designed for large-scale data
processing. Originally developed by Google, it has been widely adopted in the industry, and
there have been several improvements and extensions to the MapReduce framework to
address its limitations and enhance its capabilities. Some of the key improvements include:
Performance Optimization:
Parallelization: Efforts have been made to improve the parallel processing capabilities of
MapReduce. This includes optimizing the scheduling and execution of tasks to make better
use of available resources.
Data Locality: Enhancements have been made to increase data locality, ensuring that
computation is performed on nodes where the data resides, reducing the need for data transfer
over the network.
Resource Management:
YARN (Yet Another Resource Negotiator): Apache Hadoop 2.x introduced YARN, a resource
manager that allows different processing engines to share resources on a Hadoop cluster. This
allows for more efficient resource utilization and better support for multi-stage data
processing workflows.
Fault Tolerance:
Job Recovery: MapReduce frameworks have become more robust in handling node failures
and job recovery. Checkpointing mechanisms and fault-tolerant strategies are employed to
ensure that jobs can recover from failures without starting from scratch.
6
Programming Abstractions:
Higher-Level APIs: Higher-level abstractions and APIs have been developed to simplify the
development of MapReduce applications. Libraries like Apache Pig and Apache Hive
provide more declarative languages and abstractions, making it easier for developers to
express complex data processing tasks.
Ease of Use:
Apache Hadoop Ecosystem: The Hadoop ecosystem has expanded to include various tools
and frameworks that work seamlessly with MapReduce, making it easier for users to build
end-to-end data processing pipelines. For example, Apache Spark provides a more expressive
and user-friendly API for distributed data processing.
Real-time Processing:
Apache Flink and Apache Storm: For scenarios requiring low-latency processing, other
frameworks like Apache Flink and Apache Storm have gained popularity. These frameworks
enable real-time stream processing in addition to batch processing.
Tez and Spark: Frameworks like Apache Tez and Apache Spark have been developed to
provide more optimized execution engines for certain types of workloads, offering
improvements in terms of performance and flexibility.
Dynamic Scaling:
Security Enhancements:
Cloud-Native Solutions: MapReduce frameworks have been adapted for cloud computing
environments, with better integration with cloud services. This includes managed services on
cloud platforms like Amazon EMR, Google Cloud Dataproc, and Azure HDInsight.
7
MapReduce, a parallel programming model popularized by Google and widely implemented
in distributed computing frameworks such as Apache Hadoop, has become a cornerstone in
the processing of large-scale data. However, the efficiency of MapReduce is contingent upon
effective task scheduling and load balancing. Task scheduling refers to the allocation of
computing resources for executing individual tasks, while load balancing ensures that these
tasks are evenly distributed across the available nodes in a cluster. Optimizing these aspects is
crucial for achieving high-performance data processing in distributed environments.
One of the primary optimizations in task scheduling involves maximizing data locality. By
placing tasks on nodes where the required data resides, the need for data transfer over the
network is minimized, reducing latency. Modern MapReduce frameworks, including Apache
Hadoop's YARN, emphasize intelligent task placement strategies to enhance data locality.
Speculative Execution:
To combat stragglers — tasks that take longer to complete than expected — speculative
execution is employed. The framework identifies slow-performing tasks and launches backup
copies on other nodes. The first completed instance is then used, mitigating the impact of
slow nodes on overall job completion time.
Priority Scheduling:
Prioritizing tasks based on their importance and urgency is crucial for meeting different
application requirements. MapReduce frameworks often support priority scheduling,
allowing critical tasks to be scheduled with higher priority, ensuring their timely execution.
Effective load balancing starts with appropriate task granularity and split size. Optimizing
these parameters ensures that each task is of a manageable size, preventing imbalances
caused by small or overly large tasks. Fine-tuning these aspects contributes to a more uniform
distribution of computation across the cluster.
Static load balancing strategies may not adapt well to changing workloads. Dynamic load
balancing mechanisms continuously monitor the performance of nodes and redistribute tasks
based on real-time metrics. This adaptability ensures efficient resource utilization even as the
workload fluctuates.
8
Centralized vs. Decentralized Schedulers:
The choice between centralized and decentralized schedulers impacts load balancing.
Centralized schedulers, like those in Hadoop 1.x, may encounter bottlenecks, whereas
decentralized schedulers, as seen in YARN, distribute scheduling decisions across multiple
components, enhancing scalability and load balancing.
Task Migration:
In scenarios where imbalances persist, task migration can be employed. This involves
transferring tasks from heavily loaded nodes to underutilized ones, redistributing the
computational load and maintaining a more equitable distribution of resources.
Earlier, there was a single scheduler which was intermixed with the JobTracker logic,
supported by Hadoop. However, for the traditional batch jobs of Hadoop (such as log mining
and Web indexing), this implementation was perfect. Yet this implementation was inflexible
as well as impossible to tailor.
Well, for scheduling users jobs, previous versions of Hadoop had a very simple way.
Basically, by using a Hadoop FIFO scheduler, they ran in order of submission. Further, by
using the mapred.job.priority property or the setJobPriority() method on JobClient, it adds
the ability to set a job’s priority. The job scheduler selects one with the highest priority when
it is choosing the next job to run. Although, priorities do not support preemption, with the
FIFO scheduler in Hadoop. Hence by a long-running low priority job that started before the
high-priority job was scheduled, a high-priority job can still be blocked.
Additionally, in Hadoop, MapReduce comes along with a choice of schedules, like Hadoop
FIFO scheduler, and some multiuser schedulers such as Fair Scheduler in Hadoop as well as
the Hadoop Capacity Scheduler.
There are several types of schedulers which we use in Hadoop, such as:
9
Types of Hadoop Schedulers
An original Hadoop Job Scheduling Algorithm which was integrated within the JobTracker is
the FIFO. Basically, as a process, a JobTracker pulled jobs from a work queue, that says
oldest job first, this is a Hadoop FIFO scheduling. Moreover, this is simpler as well as
efficient approach and it had no concept of the priority or size of the job.
10
b. Hadoop Fair Scheduler
Further, to give every user a fair share of the cluster capacity over time, we use the Fair
Scheduler in Hadoop. It gets all of the Hadoop Clusters if a single job is running. Further,
free task slots are given to the jobs in such a way as to give each user a fair share of the
cluster, as more jobs are submitted.
If a pool has not received its fair share for a certain period of time, then the Hadoop Fair
Scheduler supports preemption. Further, the scheduler will kill tasks in pools running over
capacity to give the slots to the pool running under capacity.
org.apache.hadoop.mapred.FairScheduler
Except for one fact that within each queue, jobs are scheduled using FIFO scheduling in
Hadoop (with priorities), this is like the Fair Scheduler. It takes a slightly different approach
for multiuser scheduling. Moreover, for each user or an organization, it permits to simulate a
separate MapReduce Cluster along with FIFO scheduling.
Instead of the scheduler, Hadoop also offers the concept of provisioning virtual clusters from
within larger physical clusters, which we also call Hadoop On Demand (HOD). It uses the
Torque resource manager for node allocation on the basis of the requirement of the virtual
cluster. The HOD system initializes the system based on the nodes within the virtual cluster,
along with allocated nodes, after preparing configuration files, automatically. Also, we can
use the HOD virtual cluster in a relatively independent way, after the initialization.
In other words, an interesting model for deployments of Hadoop clusters within a cloud
infrastructure is what we call HOD. It offers greater security as an advantage in that with less
sharing of the nodes.
11
When to Use Each Scheduler in Hadoop?
So, we concluded that the capacity scheduler is the right choice while we want to ensure
guaranteed access with the potential in order to reuse unused capacity as well as prioritize
jobs within queues, while we are running a large Hadoop cluster, along with the multiple
clients.
Whereas, when we use both small and large clusters for the same organization with a limited
number of workloads, the fair scheduler works well. Also, in a simpler and less configurable
way, it offers the means to non-uniformly distribute capacity to pools (of jobs). Furthermore,
it can offer fast response times for small jobs mixed with larger jobs (supporting more
interactive use models). Hence, it is useful in the presence of diverse jobs.
In Hadoop, we can receive multiple jobs from different clients to perform. The Map-Reduce
framework is used to perform multiple tasks in parallel in a typical Hadoop cluster to process
large size datasets at a fast rate. This Map-Reduce Framework is responsible for scheduling
and monitoring the tasks given by different clients in a Hadoop cluster. But this method of
scheduling jobs is used prior to Hadoop 2.
Now in Hadoop 2, we have YARN (Yet Another Resource Negotiator). In YARN we have
separate Daemons for performing Job scheduling, Monitoring, and Resource Management as
Application Master, Node Manager, and Resource Manager respectively.
Here, Resource Manager is the Master Daemon responsible for tracking or providing the
resources required by any application within the cluster, and Node Manager is the slave
Daemon which monitors and keeps track of the resources used by an application and sends
the feedback to Resource Manager.
Schedulers and Applications Manager are the 2 major components of resource Manager.
The Scheduler in YARN is totally dedicated to scheduling the jobs, it can not track the status
of the application. On the basis of required resources, the scheduler performs or we can say
schedule the Jobs.
12
There are mainly 3 types of Schedulers in Hadoop:
1. FIFO (First In First Out) Scheduler.
2. Capacity Scheduler.
3. Fair Scheduler.
These Schedulers are actually a kind of algorithm that we use to schedule tasks in a Hadoop
cluster when we receive requests from different-different clients.
A Job queue is nothing but the collection of various tasks that we have received from our
various clients. The tasks are available in the queue and we need to schedule this task on the
basis of our requirements.
1. FIFO Scheduler
As the name suggests FIFO i.e. First In First Out, so the tasks or application that comes first
will be served first. This is the default Scheduler we use in Hadoop. The tasks are placed in a
queue and the tasks are performed in their submission order. In this method, once the job is
13
scheduled, no intervention is allowed. So sometimes the high-priority process has to wait for
a long time since the priority of the task does not matter in this method.
Advantage:
● No need for configuration
● First Come First Serve
● simple to execute
Disadvantage:
● Priority of task doesn’t matter, so high priority jobs need to wait
● Not suitable for shared cluster
2. Capacity Scheduler
In Capacity Scheduler we have multiple job queues for scheduling our tasks. The Capacity
Scheduler allows multiple occupants to share a large size Hadoop cluster. In Capacity
Scheduler corresponding for each job queue, we provide some slots or cluster resources for
performing job operation. Each job queue has it’s own slots to perform its task. In case we
have tasks to perform in only one queue then the tasks of that queue can access the slots of
other queues also as they are free to use, and when the new task enters to some other queue
then jobs in running in its own slots of the cluster are replaced with its own job.
Capacity Scheduler also provides a level of abstraction to know which occupant is utilizing
the more cluster resource or slots, so that the single user or application doesn’t take
disappropriate or unnecessary slots in the cluster. The capacity Scheduler mainly contains 3
types of the queue that are root, parent, and leaf which are used to represent cluster,
organization, or any subgroup, application submission respectively.
Advantage:
● Best for working with Multiple clients or priority jobs in a Hadoop cluster
● Maximizes throughput in the Hadoop cluster
14
Disadvantage:
● More complex
● Not easy to configure for everyone
3. Fair Scheduler
The Fair Scheduler is very much similar to that of the capacity scheduler. The priority of the
job is kept in consideration. With the help of Fair Scheduler, the YARN applications can
share the resources in the large Hadoop Cluster and these resources are maintained
dynamically so no need for prior capacity. The resources are distributed in such a manner that
all applications within a cluster get an equal amount of time. Fair Scheduler takes Scheduling
decisions on the basis of memory, we can configure it to work with CPU also.
As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair
Scheduler whenever any high priority job arises in the same queue, the task is processed in
parallel by replacing some portion from the already dedicated slots.
Advantages:
● Resources assigned to each application depend upon its priority.
● it can limit the concurrent running task in a particular pool or queue.
Disadvantages: The configuration is required.
15
Improvement of the Hadoop Job Scheduling Algorithm
Hadoop, an open-source framework for distributed storage and processing of large datasets,
relies on efficient job scheduling algorithms to manage the execution of tasks across a cluster
of nodes. As the scale and complexity of data processing workloads continue to increase,
continuous improvements in job scheduling are crucial for optimizing resource utilization,
minimizing job completion times, and enhancing overall system performance.
Fair Scheduler:
The Fair Scheduler was introduced as an improvement over the default Hadoop scheduler to
provide fairness in resource allocation among different users and jobs. It divides the cluster's
resources into pools, ensuring that each user or job gets a fair share of resources, preventing
the dominance of large jobs over smaller ones.
Capacity Scheduler:
Weighted Fair Queuing (WFQ) is an enhancement to the Fair Scheduler that introduces the
concept of weights to allocate resources. This allows users or jobs to be assigned different
16
weights, influencing their access to cluster resources accordingly. Jobs with higher weights
receive a larger share of resources.
Delay Scheduling:
Delay Scheduling aims to improve job completion times by delaying the assignment of tasks
to nodes until resources become available on a suitable node. This helps in achieving better
data locality and reduces the chances of stragglers, enhancing the overall efficiency of the job
execution.
Proportional-Share Scheduling:
Leveraging machine learning techniques to predict job resource requirements and execution
times can significantly enhance scheduling decisions. Predictive analytics can help the
scheduler make proactive adjustments, optimizing resource allocations for better overall
performance.
Combining the strengths of multiple scheduling algorithms in a hybrid approach can offer the
best of both worlds. Hybrid schedulers can adapt to different workload characteristics,
providing flexibility and efficiency in resource allocation.
17
ensuring fair and efficient allocation of cluster resources. Future developments are likely to
focus on adaptability to diverse workloads, integration with emerging technologies, and the
utilization of advanced analytics to further refine scheduling decisions in Hadoop
environments.
Hadoop, a cornerstone in big data processing, relies on robust job management frameworks
to orchestrate the execution of distributed tasks across clusters of nodes. As the volume and
complexity of data continue to grow, improvements in Hadoop job management frameworks
are essential for optimizing resource utilization, minimizing job completion times, and
ensuring seamless scalability. This essay explores various advancements and strategies
employed to enhance the efficiency of Hadoop job management frameworks.
YARN represents a paradigm shift in Hadoop's job management capabilities. It decouples the
resource management and job scheduling functions, enabling a more flexible and scalable
architecture. YARN allows multiple processing engines, such as MapReduce and Apache
Spark, to coexist and share resources on a Hadoop cluster, providing improved support for
diverse workloads.
Advanced job prioritization and scheduling policies have been introduced to better cater to
diverse user requirements. This includes fair scheduling policies, capacity-based scheduling,
and weighted fair queuing, allowing administrators to allocate resources based on user
priorities, job sizes, and other relevant factors.
18
Containerization technologies, such as Docker, have been integrated into Hadoop ecosystems
to streamline job management. Containers encapsulate application code and dependencies,
facilitating consistent deployment across different environments and reducing the overhead
associated with managing dependencies on individual nodes.
Improved job monitoring and visualization tools have been introduced to provide
administrators and users with real-time insights into job progress, resource utilization, and
potential bottlenecks. These tools enhance transparency and facilitate more informed
decision-making in managing Hadoop jobs.
Integration with workflow management tools, such as Apache Oozie, enables the seamless
coordination of complex data processing workflows. This ensures that multiple jobs can be
orchestrated and scheduled in a coherent manner, supporting end-to-end data processing
pipelines.
Enhancements in logging and auditing capabilities help in tracking job execution details,
identifying performance bottlenecks, and ensuring compliance with security and governance
requirements. These features contribute to a more transparent and accountable job
management framework.
The evolution of Hadoop job management frameworks includes improved support for
multi-tenancy, allowing multiple users or organizations to share a common cluster while
maintaining isolation and fair resource allocation.
Hadoop Distributed File System (HDFS) is a critical component of the Hadoop ecosystem,
providing scalable and reliable storage for large-scale data processing. To ensure optimal
performance in handling vast amounts of data across distributed clusters, several strategies
and techniques have been developed to optimize HDFS. This essay explores key performance
19
optimization strategies for HDFS, addressing aspects such as data storage, retrieval, and
overall system efficiency.
Optimizing the block size and replication factor in HDFS is crucial for balancing data storage
efficiency and fault tolerance. Larger block sizes can reduce metadata overhead, while an
appropriate replication factor ensures data availability in the face of node failures. Tuning
these parameters based on workload characteristics and cluster configuration is essential for
performance optimization.
Efficient data node placement and rack awareness in HDFS contribute to improved data
locality, reducing network overhead during data access. By strategically placing data nodes
on different racks and considering network proximity, HDFS can optimize data retrieval by
minimizing inter-rack data transfers.
Leveraging Solid State Drives (SSDs) for storage in HDFS can significantly enhance read
and write performance. SSDs offer faster access times compared to traditional Hard Disk
Drives (HDDs), making them well-suited for scenarios where low-latency data access is
critical.
Employing memory caching mechanisms, such as HDFS caching and the use of technologies
like Apache Hadoop Distributed Cache, helps reduce the I/O overhead by caching frequently
accessed data in memory. This enhances the speed of data retrieval, especially for iterative
algorithms and commonly used datasets.
Maximizing parallelism during data reads and writes is essential for optimizing HDFS
performance. This involves concurrent execution of multiple read and write operations across
data nodes, leveraging the parallel processing capabilities of the underlying cluster
infrastructure.
Ensuring balanced Disk Input/Output (I/O) across data nodes helps prevent hotspots and
bottlenecks. Distributing data evenly across the cluster and avoiding imbalances in read and
write operations contribute to a more efficient utilization of storage resources.
20
Heterogeneous Storage Policies:
Implementing storage policies that consider the performance characteristics of different types
of storage devices allows for tiered storage within HDFS. This enables the placement of data
on storage media that best suits its access patterns, balancing cost and performance
requirements.
Compression Techniques:
Applying compression techniques to HDFS data can lead to significant space savings and
reduce the amount of data transferred over the network. However, it's essential to strike a
balance between compression ratios and the computational overhead of compression and
decompression during data processing.
Regularly updating Hadoop and HDFS to the latest versions ensures that the cluster benefits
from performance improvements, bug fixes, and new features. Staying current with software
releases is crucial for maintaining a high level of efficiency and security.
Optimizing the performance of Hadoop Distributed File System is essential for the
efficient processing of large-scale data in distributed environments. The strategies mentioned,
ranging from block size and replication tuning to the use of SSDs and memory caching,
collectively contribute to a well-tuned and high-performance HDFS. As data volumes
continue to grow, ongoing research and development efforts will likely focus on further
advancements in storage technologies and optimization techniques to meet the evolving
demands of big data processing.
As HDFS separates metadata management from block management, clients have to follow a
complex protocol to read a file even if the file only has a few bytes of data. When reading a
file, a client first contacts the namenode to get the location of the data block(s) of the file. The
namenode returns the locations of the blocks to the client after checking that the client is
authorized to access the file. Upon receiving the locations of the data blocks the client
establishes communication channels with the datanodes that store the data blocks and reads
the data sequentially. If the client is located on the same datanode that stores the desired
block then the client can directly read the data from the local disk. This protocol is very
expensive for reading/writing small files where the time required to actually read/write the
21
small data block is significantly smaller than the time taken by the associated file system
metadata operations and data communication protocols.
The problem is even worse for writing small files, as the protocol for writing a file involves a
relatively very large number of file system operations for allocating inodes, blocks, and data
transfer. In order to write a file, the client first sends a request to the namenode to create a
new inode in the namespace. The namenode allocates a new inode for the file after ensuring
that the client is authorized to create the file. After successfully creating an inode for the new
file the client then sends another file system request to the namenode to allocate a new data
block for the file. The namenode then returns the address of three datanodes where the client
should write the data block (triple replication, by default). The client then establishes a data
transfer pipeline involving the three datanodes and starts sending the data to the datanodes.
The client sends the data sequentially to the first datanode in the data transfer pipeline, and
the first datanode then forwards the data to the second datanode, and so on. As soon as the
datanodes start to receive the data, they create a file on the local file system to store the data
and immediately send an RPC request to the namenode informing it about the allocation of
the new block. Once the data is fully written to the blocks, the datanodes send another RPC
request to the namenode about the successful completion of the block. The client can then
send a request to the namenode to allocate a new block or close the file. Clearly, this protocol
is only suitable for writing very large files where the time required to stream the data would
take much longer than the combined time of all the file system operations involved in the file
write protocol, that is, the cost of the metadata operations and establishing communication
channels with the datanodes is amortized over the relatively long periods of time spent in
reading/writing large files. In contrast, the latency of file system operations performed on
small files is dominated by the time spent on metadata operations, as reading/writing a small
file involves the client communicating with both the namenode and at least one datanode.
Row Key Design: Careful design of the row key is crucial for optimal
performance. The row key should reflect the access patterns of queries to
facilitate efficient retrieval. Sequential, short, and evenly distributed row keys
often lead to better performance.
22
Column Family and Qualifier Design: Rationalize the column family and
qualifier design based on query requirements. Avoid excessive use of column
families, and strive for a balance between the number of column qualifiers and
the amount of data stored within a single row.
Memory Management:
Heap Configuration: Tune Java Virtual Machine (JVM) heap settings for
RegionServers to ensure that sufficient memory is allocated. Adequate memory
helps reduce disk I/O by caching frequently accessed data in memory.
Block Cache Usage: Optimize the use of HBase's block cache, a mechanism for
caching HFile blocks in memory. Configuring an appropriate block cache size
and understanding the access patterns of queries can significantly improve read
performance.
Compaction Strategies:
Indexing Techniques:
23
Hardware Considerations:
Caching Mechanisms:
Compression Techniques:
24
Performance optimization in HBase is a multifaceted task that requires careful
consideration of data modeling, configuration tuning, and architectural choices.
The strategies mentioned, from effective data modeling and memory
management to compaction and indexing techniques, collectively contribute to a
well-optimized HBase deployment. Continuous monitoring, experimentation,
and adaptation to changing workloads are crucial for maintaining optimal
performance as data volumes and usage patterns evolve. As the big data
landscape continues to advance, ongoing research and development efforts will
likely focus on further optimizations and innovations to meet the ever-growing
demands of real-time, distributed data processing.
HBase Framework
1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which
regions are assigned to region server as well as DDL (create, delete table) operations. It
monitor all Region Server instances present in the cluster. In a distributed environment,
25
Master runs several background threads. HMaster has many features like controlling load
balancing, failover etc.
2. Region Server –
HBase Tables are divided horizontally by row key range into Regions. Regions are the
basic building elements of HBase cluster that consists of the distribution of tables and are
comprised of Column families. Region Server runs on HDFS DataNode which is present
in Hadoop cluster. Regions of Region Server are responsible for several things, like
handling, managing, executing as well as reads and writes HBase operations on that set of
regions. The default size of a region is 256 MB.
3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration
information, naming, providing distributed synchronization, server failure notification etc.
Clients communicate with region servers via zookeeper.
Advantages of HBase –
Disadvantages of HBase –
2. No transaction support
● HBase provides low latency access while HDFS provides high latency operations.
● HBase supports random read and writes while HDFS supports Write once Read Many
times.
26
● HBase is accessed through shell commands, Java API, REST, Avro or Thrift API while
HDFS is accessed through MapReduce jobs.
Distributed and Scalable: HBase is designed to be distributed and scalable, which means it
can handle large datasets and can scale out horizontally by adding more nodes to the cluster.
Column-oriented Storage: HBase stores data in a column-oriented manner, which means
data is organized by columns rather than rows. This allows for efficient data retrieval and
aggregation.
Hadoop Integration: HBase is built on top of Hadoop, which means it can leverage
Hadoop’s distributed file system (HDFS) for storage and MapReduce for data processing.
Consistency and Replication: HBase provides strong consistency guarantees for read and
write operations, and supports replication of data across multiple nodes for fault tolerance.
Built-in Caching: HBase has a built-in caching mechanism that can cache frequently
accessed data in memory, which can improve query performance.
Compression: HBase supports compression of data, which can reduce storage requirements
and improve query performance.
Flexible Schema: HBase supports flexible schemas, which means the schema can be updated
on the fly without requiring a database schema migration.
Note – HBase is extensively used for online analytical operations, like in banking
applications such as real-time data updates in ATM machines, HBase can be used.
27
Storage Mechanism in HBase
HBase is a column-oriented database and the tables in it are sorted
by row. The table schema defines only column families, which are
the key value pairs. A table have multiple column families and
each column family can have any number of columns.
Subsequent column values are stored contiguously on the disk.
Each cell value of the table has a timestamp. In short, in an
HBase:
28
HBase and RDBMS
HBase RDBMS
HBase is schema-less, it doesn't have the An RDBMS is governed by its
concept of fixed columns schema; schema, which describes the whole
defines only column families. structure of tables.
It is built for wide tables. HBase is It is thin and built for small tables.
horizontally scalable. Hard to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as
It is good for structured data.
structured data.
Features of HBase
● HBase is linearly scalable.
● It has automatic failure support.
● It provides consistent read and writes.
● It integrates with Hadoop, both as a source and a
destination.
● It has easy java API for client.
● It provides data replication across clusters.
29
Where to Use HBase
● Apache HBase is used to have random, real-time read/write
access to Big Data.
● It hosts very large tables on top of clusters of commodity
hardware.
● Apache HBase is a non-relational database modeled after
Google's Bigtable. Bigtable acts up on Google File System,
likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
● It is used whenever there is a need to write heavy
applications.
● HBase is used whenever we need to provide fast random
access to available data.
● Companies such as Facebook, Twitter, Yahoo, and Adobe use
HBase internally.
HBase History
Year Event
Nov 2006 Google released the paper on BigTable.
Feb 2007 Initial HBase prototype was created as a Hadoop contribution.
Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.
Jan 2008 HBase became the sub project of Hadoop.
Oct 2008 HBase 0.18.1 was released.
Jan 2009 HBase 0.19.0 was released.
Sept 2009 HBase 0.20.0 was released.
May 2010 HBase became Apache top-level project.
30