0% found this document useful (0 votes)
39 views

Big Data Unit 2

Hadoop is an open source framework that manages large amounts of data using distributed storage and parallel processing. It has several key components including HDFS for storage, YARN for resource management, and MapReduce for data processing. HDFS uses distributed storage across nodes and parallel processing to handle big data workloads. YARN is responsible for managing cluster resources and scheduling applications. MapReduce provides a programming model for processing large datasets in parallel across a distributed system.

Uploaded by

Nazima Begum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Big Data Unit 2

Hadoop is an open source framework that manages large amounts of data using distributed storage and parallel processing. It has several key components including HDFS for storage, YARN for resource management, and MapReduce for data processing. HDFS uses distributed storage across nodes and parallel processing to handle big data workloads. YARN is responsible for managing cluster resources and scheduling applications. MapReduce provides a programming model for processing large datasets in parallel across a distributed system.

Uploaded by

Nazima Begum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT – 2

Understanding Hadoop Ecosystem.

Hadoop Ecosystem

Hadoop is an open source framework based on Java that manages the storage and
processing of large amounts of data for applications. Hadoop uses distributed storage and
parallel processing to handle big data and analytics jobs, breaking workloads down into
smaller workloads that can be run at the same time. It has various components such as
HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache
HBase and HBase components, HCatalog, Avro, Thrift, Drill, Apache
mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie

HDFS: Hadoop Distributed File System


YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database

1
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling

The four modules comprise the primary Hadoop framework and work collectively to
form the Hadoop ecosystem:

Hadoop Distributed File System (HDFS): As the primary component of the Hadoop
ecosystem, HDFS is a distributed file system in which individual Hadoop nodes operate
on data that resides in their local storage. This removes network latency, providing high-
throughput access to application data. In addition, administrators don’t need to define
schemas up front. IT is the primary data storage system used by Hadoop
applications. HDFS is a Java-based system that allows large data sets to be stored
across nodes in a cluster in a fault-tolerant manner. provides scalable, fault tolerance,
reliable and cost efficient data storage for Big data. HDFS is a distributed file system that

2
runs on commodity hardware. HDFS is already configured with default configuration for
many installations. Most of the time for large clusters configuration is needed. Hadoop
interact directly with HDFS by shell-like commands.
 It is responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
It is also known as Master node. NameNode does not store actual data or dataset.
NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and directories.

Functions/ Tasks of HDFS NameNode


 It executes the file system namespace operations like opening, renaming, and
closing files and directories.
 NameNode manages and maintains the DataNodes.
 It determines the mapping of blocks of a file to DataNodes.
 NameNode records each change made to the file system namespace.
 It keeps the locations of each block of a file.
 NameNode takes care of the replication factor of all the blocks.
 NameNode receives heartbeat and block reports from all DataNodes that ensure
DataNode is alive.
 If the DataNode fails, the NameNode chooses new DataNodes for new replicas.

 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These
data nodes are commodity hardware (affordable, easily replaceable computer
hardware that is typically built from off-the-shelf components ) in the distributed
environment. Undoubtedly, making Hadoop cost effective.

Functions/Tasks of DataNode
 DataNode is responsible for serving the client read/write requests.
 Based on the instruction from the NameNode, DataNodes performs block creation,
replication, and deletion.
 DataNodes send a heartbeat to NameNode to report the health of HDFS.
 DataNodes also sends block reports to NameNode to report the list of blocks it
contains.

3
FsImage is a file stored on the OS filesystem that contains the complete directory
structure (namespace) of the HDFS with details about the location of the data on the Data
ABlocks and which blocks are stored on which node. This file is used by the NameNode
when it is started.

EditLogs is a transaction log that records the changes in the HDFS file system or any
action performed on the HDFS cluster such as addition of a new block, replication,
deletion etc. In short, it records the changes since the last FsImage was created.

The Checkpoint node periodically creates checkpoints of the namespace. It downloads


fsimage and edits from the active NameNode, merges them locally, and uploads the new
image back to the active NameNode.

The secondary namenode is a dedicated node in the HDFS cluster that takes checkpoints
of the file system metadata on the namenode. It is a helper node that helps the primary
namenode and merges namespaces.

 Backup node – used for backup


 Job tracker node – JobTracker is a service that receives client requests and assigns them
to TaskTrackers. It is a component of the MapReduce framework that distributes tasks
across available nodes in a cluster.
Task tracker node – A TaskTracker is a node in a Hadoop cluster that receives tasks from
a JobTracker. The tasks include Map, Reduce, and Shuffle operations.

4
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.

Yet Another Resource Negotiator (YARN): YARN is a resource-management platform


responsible for managing compute resources in clusters and using them to schedule users’
applications. It performs scheduling and resource allocation across the Hadoop system.
YARN sits between HDFS and the processing engines deployed by users.

YARN is made up of 3 main pieces:

 Resource Manager
 Node Manager
 Application Master

Resource Manager
ResourceManager (RM) is the master that arbitrates all the available cluster resources and
thus helps manage the distributed applications running on the YARN system. It works
together with the per-node NodeManagers (NMs) and the per-application
ApplicationMasters (AMs).

5
1. NodeManagers take instructions from the ResourceManager and manage resources
available on a single node.
2. ApplicationMasters are responsible for negotiating resources with the
ResourceManager and for working with the NodeManagers to start the containers.

Node Manager
The Node Manager (NM) is YARN’s per-node agent, and takes care of the individual
compute nodes in a Hadoop cluster. This includes keeping up-to date with the Resource
Manager (RM), overseeing containers’ life-cycle management; monitoring resource
usage (memory, CPU) of individual containers, tracking node-health, log’s management
and auxiliary services which may be exploited by different YARN applications.

Application Master
The Application Master is, in effect, an instance of a framework-specific library and is
responsible for negotiating resources from the Resource Manager and working with the
Node Manager(s) to execute and monitor the containers and their resource consumption.
It has the responsibility of negotiating appropriate resource containers from the Resource
Manager, tracking their status and monitoring progress.

6
MapReduce: MapReduce is a programming model for large-scale data processing.
MapReduce processes a large volume of data in parallel, by dividing a task into
independent sub-tasks. The MapReduce works by breaking the processing into two tasks:
the map task and the reduce task.

Map takes a set of data and converts it into another set of data, where individual elements
are broken down into key-value pairs. Then the reduce task receives the output from a
map as an input which is in key-value pairs and combines those data into a smaller set of
key-value pairs.

By default, the type input type in MapReduce is ‘text’. MapReduce programs written in
various languages such as Java, Ruby, Python, and C++.

The MapReduce architecture

MapReduce architecture has the following two daemon processes:

1. JobTracker: Master process


2. TaskTracker: Slave process
JobTracker: JobTracker is the master process and is responsible for coordinating and
completing a MapReduce job in Hadoop. The main functionality of JobTracker is
resource management, tracking resource availability, and keeping track of our requests.

TaskTracker: TaskTracker is the slave process to the JobTracker.TaskTracker sends


heartbeat messages to JobTracker every 3 seconds to inform JobTracker about the free
slots and sends the status about the task and checks if any task has to be performed

The MapReduce phases

MapReduce frameworks have multiple steps and processes or tasks. MapReduce jobs are
complex and involve multiple steps; some steps are performed by Hadoop with default
behavior and can be overridden if needed.

The MapReduce program is executed in three main phases: mapping phase, shuffling and
sorting phase, and reducing phase. There is also an optional phase known as the combiner
phase.

 Mapping phase
This is the first phase of the program. There are two steps in this phase: splitting and
mapping. The input file is divided into smaller equal chunks for efficiency which are

7
called input splits. Since the Mappers understand (key, value) pairs only so Hadoop uses
a RecordReader that uses TextInputFormat to transform input splits into key-value pairs.

Input Split is a logical partition of the data, basically used during data processing in the
MapReduce program or other processing techniques.his logical partition of data is
processed one per Mapper

In MapReduce, parallelism will be achieved by Mapper. For each Input split, a new
instance of the mapper is instantiated. The mapping step contains a coding logic that is
applied to these data blocks. In this step, the mapper processes the key-value pairs and
produces an output of the same form (key-value pairs).

 Shuffle and sorting phase


Shuffle and sort are intermediate steps in MapReduce between Mapper and Reducer,
which is handled by Hadoop and can be overridden if required. The Shuffle process
aggregates all the Mapper output by grouping key values of the Mapper output and the
value will be appended in a list of values. So, the Shuffle output format will be a map
<key, List<list of values>>. The key from the Mapper output will be consolidated and
sorted.

Sorting and shuffling are responsible for creating a unique key and a list of values.
Making similar keys at one location is known as Sorting. And the process by which the
intermediate output of the mapper is sorted and sent across to the reducers is known as
Shuffling.

 Reducer phase
The output of the shuffle and sorting phase is used as the input to the Reducer phase and
the Reducer will process on the list of values. Each key could be sent to a different
Reducer. Reducer can set the value, and that will be consolidated in the final output of a
MapReduce job and the value will be saved in HDFS as the final output.

Suppose the text file which we are using is called test.txt and it contains the following
data:

Data Hadoop Python

Hadoop Hadoop Java

Python Data Apache

8
The output which we expect should look like this:
Apache –1

Data – 2

Hadoop – 3

Java – 1

Python –2

MAPPING PHASE – Input Split + Mapping REDUCING PHASE – Shuffling +


Reducer

Suppose a user runs a query (count number of occurrences of all the unique words) on
our test.txt file. To keep a track of our requests, we use Job Tracker (a master service),
the JobTracker traps our request and keeps a track of it.

First, the records are divided into smaller chunks for efficiency, in our case the input is
divided into 3 chunks which are called input splits. Since there are three input splits, three
different mappers will be used for each split.

9
Mapper then parses the line, gets a word, and sets <<word>, 1> for each word. In this
example, the output of Mapper for a line Java Python Hadoop will be <Java, 1>,
<Python, 1>, and <Hadoop, 1>.

All Mappers will have a key as a word and value as hardcoded value 1. The reason for
giving a hardcoded value 1 and not any other value is because every word in itself will
occur once.

In the next phase ( shuffle and sorting ) the key-value pair output from the Mapper having
the same key will be consolidated. So key with ‘Hadoop’, ‘Data’, ‘Java’, and others will
be consolidated, and values will be appended as a list, in this case <Java, List<1>>,
<Python, List<1,1>>, <Hadoop, List<1, 1, 1, 1>>and so on.

The key produced by Mappers will be compared and sorted. The key and list of values
will be sent to the next step in the sorted sequence of the key.

Next, the reducer phase will get <key, List<>> as input, and will just count the number of
1s in the list and will set the count value as output. for example, the output for certain
steps are given as follows:

<Java, List<1>> will be < Java, 1>

<Python, List<1, 1>> will be <Python, 2>

<Hadoop, List<1, 1, 1, 1>> will be <Hadoop, 4>

In the end, all the output key/value pairs from the Reducer phase will be consolidated to a
file and will be saved in HDFS as the final output.

Hadoop MapReduce Applications

MapReduce is used in many applications let us have a look at some of the applications.

 Entertainment :
A lot of web series and movies are released on various OTT platforms such as Netflix.
You might have come across a situation where you can’t decide which movie to watch
and you take a look at the suggestions provided by Netflix and then you try one of the
suggested series or movies.

10
Hadoop and MapReduce are used by Netflix to recommend the user some popular
movies based on what they watched and which movie they like.

MapReduce can determine how users are watching movies, analyzing their logs and
clicks.

 E-commerce:
Many E-commerce companies such as Flipkart, Amazon, and eBay use MapReduce to
analyze the buying behavior of the customers based on customers’ interests or their
buying behavior. It analyzes records, purchase history, user interaction logs, etc., and
provides product recommendation mechanisms for various e-commerce companies.

Here are some techniques for optimizing MapReduce jobs:

 Cluster configuration: Ensure your cluster is properly configured.


 LZO compression: Use LZO compression for transitional data. LZO can reduce the
amount of disk I/O, but it does add some overhead to the CPU.
 Combiner: Write a Combiner between the Mapper and Reducer. This can reduce
shuffling and optimize the MapReduce job.
 Writables: Use the most appropriate and compact writable type for data. Also, reuse
Writables.
 Input split size: Increase the input split size to reduce the number of MapReduce tasks.
 Caching: Use caching to store intermediate results in memory and reduce disk I/O.
 Block size: For large files, try keeping the block size at 256 MB or 512 MB.
 Setup and cleanup tasks: Optimize the setup and cleanup tasks to reduce the time cost
during the initialization and termination stages of a job.
 Task assignment mechanism: Replace the pull-model task assignment mechanism with a
push-model.
 Communication mechanism: Replace the heartbeat-based communication mechanism.
 In the Hadoop framework, a heartbeat is a signal sent by a DataNode to the
NameNode. The signal indicates that the DataNode is alive and functioning
properly. The NameNode interprets the signal as a sign of vitality.
 The heartbeat interval is 3 seconds by default. It is configured in the
property dfs.heartbeat.
 The heartbeat mechanism ensures that the NameNode is aware of the status of
DataNodes. If the NameNode doesn't know that a DataNode is down, it can't
continue processing using replicas.

11
 If there is no response to the signal, it is understood that there are technical
problems with the DataNode.

Advantages of MapReduce

One of the main advantages of MapReduce is its scalability and fault-tolerance.


MapReduce can handle petabytes of data by splitting it into smaller chunks and
assigning them to multiple nodes in a cluster.

 Large log file processing


MapReduce can be used to process and analyze large log files from web servers,
application servers, and other systems. It can be used to identify trends, detect
anomalies, and monitor system performance.
 Large-scale graph analysis
MapReduce is suitable for large-scale graph analysis. It was originally developed for
determining PageRank of web documents.
 Machine learning
MapReduce can be used in machine learning.
 Social media data analysis
MapReduce can be used to analyze social media data. For example, it can be used to
analyze millions of tweets to find the most common hashtags.

MapReduce has a number of features, including:

 Scalability: MapReduce can handle petabytes of data by splitting it into smaller chunks
and assigning them to multiple nodes in a cluster.
 Fault-tolerance: MapReduce can handle petabytes of data by splitting it into smaller
chunks and assigning them to multiple nodes in a cluster.
 Parallel processing-compatible: MapReduce is suitable for iterative computation
involving large quantities of data requiring parallel processing.
 Versatile: Businesses can use MapReduce programming to access new data sources.
 Fast-paced: MapReduce is fast-paced.
 Based on a simple programming model: MapReduce is based on a simple programming
model.

12
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared
by other Hadoop modules.
Hadoop consists of the Hadoop Common package, which provides file system and
operating system level abstractions, a MapReduce engine (either MapReduce/MR1 or
YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common
package contains the Java Archive (JAR) files and scripts needed to start Hadoop.

Hadoop archive is a facility which packs up small files into one compact HDFSblock to
avoid memory wastage of name node.name node stores the metadata information of the
the HDFS data.SO,say 1GB file is broken in 1000 pieces then namenode will have to
store metadata about all those 1000 small files.In that manner,namenode memory willbe
wasted it storing and managing a lot of data.

HAR is created from a collection of files and the archiving tool will run
a MapReduce job.these Maps reduce jobs to process the input files in parallel to create an
archive file.

HARSyntax:
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example:
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this
archive for MapReduce input, all you need to specify the input directory as
har:///user/zoo/foo.har.

If we list the archive file:

$hadoop fs -ls /data/myArch.har

/data/myArch..har/_index
/data/myArch..har/_masterindex
/data/myArch..har/part-0
part files are the original files concatenated together with big files and index files are to
look up for the small files in the big part file.

Limitations of HAR Files:


1) Creation of HAR files will create a copy of the original files. So, we need as much disk
space as size of original files which we are archiving. We can delete the original files

13
after creation of archive to release some disk space.
2) Once an archive is created, to add or remove files from/to archive we need to re-create
the archive.
3) HAR file will require lots of map tasks which are inefficient.

Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosystem
continues to grow and includes many tools and applications to help collect, store, process,
analyze, and manage big data. These include Apache Pig, Apache Hive, Apache
HBase,Apache Spark, Presto, and Apache Zeppelin.
Hadoop is often used by companies who need to handle and store big data.

Zookeeper

Zookeeper is a lightweight tool that supports high availability and redundancy. A


Standby NameNode maintains an active session with the Zookeeper daemon.

If an Active NameNode falters, the Zookeeper daemon detects the failure and carries out
the failover process to a new NameNode. Use Zookeeper to automate failovers and
minimize the impact a NameNode failure can have on the cluster. Apache Zookeeper is
an open-source project that provides a centralized service for distributed systems. It
provides configuration information, naming, synchronization, and group services over
large clusters.

Zookeeper is a centralized repository where distributed applications can put data and get
data out of it. It is used to keep the distributed system functioning together as a single
unit.

14
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are those
that are triggered when some data or external stimulus is given to it.

There are two basic types of Oozie jobs:

 Oozie Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a


sequence of actions to execute. The Workflow job has to wait.
 Oozie Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by
time and data availability.
Oozie Bundle provides a way to package multiple coordinator and workflow jobs and to
manage the lifecycle of those jobs.

How Oozie Works


An Oozie Workflow is a collection of actions arranged in a Directed Acyclic Graph
(DAG) . Control nodes define job chronology, setting rules for beginning and ending a
workflow. In this way, Oozie controls the workflow execution path with decision, fork
and join nodes. Action nodes trigger the execution of tasks.
Oozie triggers workflow actions, but Hadoop MapReduce executes them. This allows
Oozie to leverage other capabilities within the Hadoop stack to balance loads and handle
failures.
Oozie detects completion of tasks through callback and polling. When Oozie starts a task,
it provides a unique callback HTTP URL to the task, thereby notifying that URL when
it’s complete. If the task fails to invoke the callback URL, Oozie can poll the task for
completion.
Often it is necessary to run Oozie workflows on regular time intervals, but in
coordination with unpredictable levels of data availability or events. In these
circumstances, Oozie Coordinator allows you to model workflow execution triggers in
the form of the data, time or event predicates. The workflow job is started after those
predicates are satisfied.
Oozie Coordinator can also manage multiple workflows that are dependent on the
outcome of subsequent workflows. The outputs of subsequent workflows become the
input to the next workflow. This chain is called a “data application pipeline”.

15
Hbase:

What is HBase?

HBase is a distributed, column-oriented database built on top of the Hadoop file


system. It's an open-source project that's horizontally scalable. HBase is a column-
oriented non-relational database management system that runs on top of Hadoop
Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse
data sets, which are common in many big data use cases. It is well suited for real-time
data processing or random read/write access to large volumes of data.
Unlike relational database systems, HBase does not support a structured query language
like SQL; in fact, HBase isn’t a relational data store at all. HBase applications are written
in Java™ much like a typical Apache MapReduce application. HBase does support
writing applications in Apache Avro, REST and Thrift.

An HBase system is designed to scale linearly. It comprises a set of standard tables with
rows and columns, much like a traditional database. Each table must have an element
defined as a primary key, and all access attempts to HBase tables must use this primary
key.

Avro, as a component, supports a rich set of primitive data types including: numeric,
binary data and strings; and a number of complex types including arrays, maps,
enumerations and records. A sort order can also be defined for the data.

HBase relies on ZooKeeper for high-performance coordination. ZooKeeper is built into


HBase, but if you’re running a production cluster, it’s suggested that you have a
dedicated ZooKeeper cluster that’s integrated with your HBase cluster.
HBase works well with Hive, a query engine for batch processing of big data, to enable
fault-tolerant big data applications.

An example of HBase

An HBase column represents an attribute of an object; if the table is storing diagnostic


logs from servers in your environment, each row might be a log record, and a typical
column could be the timestamp of when the log record was written, or the server name
where the record originated.

HBase allows for many attributes to be grouped together into column families, such that
the elements of a column family are all stored together. This is different from a row-
oriented relational database, where all the columns of a given row are stored together.
With HBase you must predefine the table schema and specify the column families.
However, new columns can be added to families at any time, making the schema flexible
and able to adapt to changing application requirements.
16
Just as HDFS has a NameNode and slave nodes, and MapReduce has JobTracker and
TaskTracker slaves, HBase is built on similar concepts. In HBase a master node manages
the cluster and region servers store portions of the tables and perform the work on the
data. In the same way HDFS has some enterprise concerns due to the availability of the
NameNode HBase is also sensitive to the loss of its master node.

Limitations of Hadoop

Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.

A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data
in a single unit of time (random access).

Hadoop Random Access Databases

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of
the databases that store huge amounts of data and access the data in a random manner.

What is HBase?

HBase is a distributed column-oriented database built on top of the Hadoop file system. It
is an open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick
random access to huge amounts of structured data. It leverages the fault tolerance
provided by the Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.

17
Comparision of HBase and HDFS
HDFS HBase
HDFS is a distributed file system HBase is a database built on top of the
suitable for storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch It provides low latency access to single
processing; no concept of batch rows from billions of records (Random
processing. access).
HBase internally uses Hash tables and
It provides only sequential access of provides random access, and it stores
data. the data in indexed HDFS files for
faster lookup

Storage Mechanism in HBase


HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have
multiple column families and each column family can have any number of columns.
Subsequent column values are stored contiguously on the disk. Each cell value of the
table has a timestamp. In short, in an HBase:
 Table is a collection of rows.
 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family

18
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction It is suitable for Online Analytical


Process (OLTP). Processing (OLAP).

Such databases are designed for small Column-oriented databases are


number of rows and columns. designed for huge tables.

“OLAP is optimized for complex data analysis and reporting, while OLTP is optimized
for transactional processing and real-time updates.”

“Performance for an OLAP system differs and can range from minutes to hours,
depending on the query complexity and volume of data. For an OLTP, processing occurs
in real-time and can be as fast as milliseconds”

Reading and writing data is much more efficient in a columnar database OLAP than a
row-oriented one OLTP.

The following image shows column families in a column-oriented database:

19
Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.

20
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.
Where to Use HBase
 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable.
Bigtable acts up on Google File System, likewise Apache HBase works on top of
Hadoop and HDFS.
Applications of HBase
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

Hive
What is Hive in Hadoop?

The Apache Hive™ data warehouse software facilitates reading, writing, and
managing large datasets residing in distributed storage using SQL. Hive is an
open-source system that processes structured data in Hadoop. Apache Hive is built
on top of Apache Hadoop for providing data query and analysis. Hive gives an
SQL-like interface to query data stored in various databases and file systems that
integrate with Hadoop. The programming language used in hive is Java.

Architecture of hive

21
Hive chiefly consists of three core parts:

 Hive Clients: Hive offers a variety of drivers designed for communication with
different applications. For example, Hive provides Thrift clients for Thrift-based
applications. These clients and drivers then communicate with the Hive server, which
falls under Hive services.
 Hive Services: Hive services perform client interactions with Hive. For example, if a
client wants to perform a query, it must talk with Hive services.
 Hive Storage and Computing: Hive services such as file system, job client, and meta
store then communicates with Hive storage and stores things like metadata table
information and query results.

Hive's Features

These are Hive's chief characteristics:

 Hive is designed for querying and managing only structured data stored in tables
 Hive is scalable, fast, and uses familiar concepts
 Schema gets stored in a database, while processed data goes into a Hadoop Distributed
File System (HDFS)
 Tables and databases get created first; then data gets loaded into the proper tables
 Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record Columnar
File), and TEXTFILE
 Hive uses an SQL-inspired language, sparing the user from dealing with the
complexity of MapReduce programming. It makes learning more accessible by
utilizing familiar concepts found in relational databases, such as columns, tables, rows,
and schema, etc.
 The most significant difference between the Hive Query Language (HQL) and SQL is
that Hive executes queries on Hadoop's infrastructure instead of on a traditional
database
 Since Hadoop's programming works on flat files, Hive uses directory structures to
"partition" data, improving performance on specific queries

22
 Hive supports partition and buckets for fast and simple data retrieval
 Hive supports custom user-defined functions (UDF) for tasks like data cleansing and
filtering. Hive UDFs can be defined according to programmers' requirements

Limitations of Hive

Of course, no resource is perfect, and Hive has some limitations. They are:

 Hive doesn’t support OLTP. Hive supports Online Analytical Processing (OLAP), but
not Online Transaction Processing (OLTP).
 It doesn’t support subqueries.
 It has a high latency.
 Hive tables don’t support delete or update operations.

How Data Flows in the Hive?

1. The data analyst executes a query with the User Interface (UI).
2. The driver interacts with the query compiler to retrieve the plan, which consists of the
query execution process and metadata information. The driver also parses the query to
check syntax and requirements.
3. The compiler creates the job plan (metadata) to be executed and communicates with
the metastore to retrieve a metadata request.
4. The metastore sends the metadata information back to the compiler
5. The compiler relays the proposed query execution plan to the driver.
6. The driver sends the execution plans to the execution engine.
7. The execution engine (EE) processes the query by acting as a bridge between the Hive
and Hadoop. The job process executes in MapReduce. The execution engine sends the
job to the JobTracker, found in the Name node, and assigns it to the TaskTracker, in
the Data node. While this is happening, the execution engine executes metadata
operations with the metastore.
23
8. The results are retrieved from the data nodes.
9. The results are sent to the execution engine, which, in turn, sends the results back to
the driver and the front end (UI).
Since we have gone on at length about what Hive is, we should also touch on what Hive
is not:

 Hive isn't a language for row-level updates and real-time queries


 Hive isn't a relational database
 Hive isn't a design for Online Transaction Processing

Hive Modes

Depending on the size of Hadoop data nodes, Hive can operate in two different modes:

 Local mode
 Map-reduce mode
User Local mode when:

 Hadoop is installed under the pseudo mode, possessing only one data node
 The data size is smaller and limited to a single local machine
 Users expect faster processing because the local machine contains smaller datasets.
Use Map Reduce mode when:

 Hadoop has multiple data nodes, and the data is distributed across these different
nodes
 Users must deal with more massive data sets
MapReduce is Hive's default mode.

24
Hive: Hive is a platform that used to create SQL-type scripts for MapReduce
functions

Maintains a data warehouse

Varied schema

Dense tables

Supports automation partition

Supports both normalized and denormalized data

Uses HQL ( Hive Query Language )

Pig

 Apache Pig is a high-level procedural language platform for creating programs that
run on Apache Hadoop. The language used for Pig is Pig Latin Pig Latin is very
similar to SQL, it is comparatively easy to learn Apache Pig if we have little
knowledge of SQL. After the introduction of Pig Latin, now, programmers are able
to work on MapReduce tasks without the use of complicated codes as in Java.

Features of Pig Hadoop

There are several features of Apache Pig:

1. In-built operators: Apache Pig provides a very good set of operators for performing
several data operations like sort, join, filter, etc.
2. Ease of programming: Since Pig Latin has similarities with SQL, it is very easy to
write a Pig script.
3. Automatic optimization: The tasks in Apache Pig are automatically optimized. This
makes the programmers concentrate only on the semantics of the language.
4. Handles all kinds of data: Apache Pig can analyze both structured and unstructured
data and store the results in HDFS.

25
Apache Pig Architecture

The main reason why programmers have started using Hadoop Pig is that it converts the
scripts into a series of MapReduce tasks making their job easy. Below is the architecture
of Pig Hadoop:

Pig Hadoop framework has four main components:

1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the
parser. The parser is responsible for checking the syntax of the script, along with
other miscellaneous checks. Parser gives an output in the form of a Directed Acyclic
Graph (DAG) that contains Pig Latin statements, together with other logical
operators represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is
passed to a logical optimizer. The optimizer is responsible for carrying out the logical
optimizations.

26
3. Compiler: The role of the compiler comes in when the output from the optimizer is
received. The compiler compiles the logical plan sent by the optimizing The logical
plan is then converted into a series of MapReduce tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs
are sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop
for yielding the desired result.

Features Hive  Apache Pig

Users  Data analysts favor  Programmers and


Apache Hive researchers prefer
Apache Pig

27
 Hive uses a declarative  Pig uses a unique
language variant of procedural language
Language Used SQL called HQL called Pig Latin

 Hive works with  Pig works with both


structured data structured and semi-
Data Handling structured data

 Hive operates on the  Pig operates on the


cluster's server-side cluster's client-side
Cluster Operation

 Hive supports  Pig doesn't support


partitioning partitioning
Partitioning

 Hive doesn't load  Pig loads quickly


quickly, but it executes
Load Speed faster

Flume
Apache Flume is a tool for collecting, aggregating, and transporting large amounts of
streaming data. It's commonly used in big data environments to ingest log files, social
media data, clickstreams, and other high-volume data sources.

Flume is a special-purpose tool for sending data into HDFS. It's a critical component for
building end-to-end streaming workloads, with typical use cases including:

 Fraud detection
 Internet of Things applications
 Aggregation of sensor and machine data
 Alerting/SIEM

28
Here's how Flume works:

1. The Flume source picks up log files from data generating sources like web servers and
Twitter and sends it to the channel.
2. The Flume's sink component ensures that the data it receives is synced to the destination.
3. Flume moves these files to the Hadoop Distributed File System (HDFS) for further
processing.
4. Flume is flexible to write to other storage solutions like HBase or Solr.
Flume is robust and fault tolerant with tunable reliability mechanisms and many failover
and recovery mechanisms.

Sqoop:

Sqoop is defined as the tool which is used to perform data transfer operations from
relational database management system to Hadoop server. Thus it helps in transfer of
bulk of data from one point of source to another point of source.
Some of the important Features of the Sqoop :
 Sqoop also helps us to connect the result from the SQL Queries into Hadoop
distributed file system.
 Sqoop helps us to load the processed data directly into the hive or Hbase.
 It performs the security operation of data with the help of Kerberos.
 With the help of Sqoop, we can perform compression of processed data.
 Sqoop is highly powerful and efficient in nature.

29
There are two major operations performed in Sqoop :
1. Import
2. Export

Scoop Import:

The ‘ Scoop Import tool’ imports individual tables from RDBMS to HDFS. Each row in a
table is treated as a record in HDFS. All records are stored as text data in the text files or
as binary data in Avro and Sequence files.

Syntax

The following syntax is used to import data into HDFS.

$ sqoop import (generic-args) (import-args)


$ sqoop-import (generic-args) (import-args)

Scoop Export:

The Scoop Export tool is used to export data back from the HDFS to the RDBMS
database. The target table must exist in the target database. The files which are given as
input to the Sqoop contain records, which are called rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.

30
The default operation is to insert all the record from the input files to the database table
using the INSERT statement. In update mode, Sqoop generates the UPDATE statement
that replaces the existing record into the database.

Syntax

The following is the syntax for the export command.

$ sqoop export (generic-args) (export-args)


$ sqoop-export (generic-args) (export-args)

Kerberos is a network authentication protocol that Hadoop uses to create secure


communications between clients and components. It ensures that only authorized users
can access data in the Hadoop cluster and provides a way to audit who has accessed the
data and when.

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy