BDA.Unit-5
BDA.Unit-5
UNIT-V
HADOOP RELATED TOOLS
Hbase – data model and implementations – Hbase clients – Hbase examples – praxis.Pig – Grunt – pig
data model – Pig Latin – developing and testing Pig Latin scripts.Hive – data types and file formats –
HiveQL data definition – HiveQL data manipulation – HiveQL queries.
HBASE
Q. What is Hbase? Draw architecture of Hbase. Explain difference between HDFS and Hbase.
Definition:
HBase is an open source, non-relational, distributed database modeled after Google's BigTable. HBase is
an open source and sorted map data built on Hadoop. It is column oriented and horizontally scalable.
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop file system. It runs on top of Hadoop and HDFS, providing Big Table-like capabilities for
Hadoop.
HBase supports massively parallelized processing via MapReduce for using HBase as both source and
sink.
HBase is a column oriented distributed database in Hadoop environment. It can store massive amounts
of data from terabytes to petabytes. HBase is scalable, distributed big data storage on top of the Hadoop
eco system.
HBase supports an easy-to-use Java API for programmatic access. It also supports Thrift and REST for
non-Java front-ends.
Hbase architecture
Zookeeper is a centralized monitoring server which maintains configuration information and provides
distributed synchronization. If the client wants to communicate with regions servers, client has to
approach Zookeeper.
HMaster in the master server of Hbase and it coordinates the HBase cluster. HMaster is responsible for
the administrative operations of the cluster.
HRegions servers: It will perform the following functions in communication with HMaster and
Zookeeper.
1. Hosting and managing regions.
2. Splitting regions automatically.
3. Handling read and writes requests.
4. Communicating with clients directly
HRegions: For each column family, HRegions maintain a store. Main components of HRegions are
Memstore and Hfile.
Data model in HBase is designed to accommodate semi-structured data that could vary in field size, data
type and columns.
HBase is a column-oriented, non-relational database. This means that data is stored in individual
columns and indexed by a unique row key. This architecture allows for rapid retrieval of individual rows
and columns and efficient scans over individual columns within a table.
Both data and requests are distributed across all servers in an HBase cluster, allowing user to query
results on petabytes of data within milliseconds. HBase is most effectively used to store non-relational
data, accessed via the HBase API.
HDFS is a distributed file system suitable HBase is a database built on top of the
for storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing; It provides low latency access to single
no concept of batch processing. rows from billions of records (Random
access).
It provides only sequential access of data. HBase internally uses Hash tables and
provides random access and it stores the
data in indexed HDFS files for faster
lookups.
HDFS are suited for high latency HBase is suited for low latency
operations. operations.
In HDFS, data are primarily accessed HBase provides access to single rows
through Map Reduce jobs. from billions of records.
HDFS doesn't have the concept of HBase data is accessed through shell
random read and write operations. commands, client API in Java. REST
Avro or Thrift.
There is one region server per node. There are many regions in a region server. At any time, a given
region is pinned to a particular region server. Tables are split into regions and are scattered across region
servers. A table must have at least one region.
Rows: A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a
Table and are always treated as a byte[ ].
Column families: Data in a row are grouped together as Column Families. Each Column Family has
one more Columns and these Columns in a family are stored together in a low level storage file known
as HFile. Column Families form the basic unit of physical storage to which certain HBase features like
compression are applied.
Columns: A Column Family is made of one or more columns. A Column is identified by a Column
Qualifier that consists of the Column Family name concatenated with the Column name using a colon
User Interface: Hive is data warehouse infrastructure software that cans create interaction between user
and HDFS.
The user interfaces that Hive supports are Hive Web UI, Hive command line and Hive HD Insight.
Meta Store: Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types and HDFS mapping.
HiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is
one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and process it.
Execution engine : The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as MapReduce results. It
uses the flavor of MapReduce.
HDFS or HBASE: Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.
Working of Hive :
Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 18
Hive working
1. Execute query: The Hive interface such as command line or Web UI sends query to driver to
execute.
2. Get plan: The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
3. Get metadata: The compiler sends metadata request to metastore.
4. Send metadata: Metastore sends metadata as a response to the compiler.
5. Send plan: The compiler checks the requirement and resends the plan to thedriver. Up to here,
the parsing and compiling of a query is complete.
6. Execute plan: The driver sends the execute plan to the execution engine.
7. Execute job: Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which
is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops: Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
8. Fetch result: The execution engine receives the results from data nodes.
9. Send results: The execution engine sends those resultant values to the driver.
10. Send results: The driver sends the results to Hive Interfaces.
DATA TYPES AND FILE FORMATS
1. Data types :
Hive data types can be classified into two categories: Primary data types and Complex data types.
Primary data types are of four types: Numeric, string, date/time and miscellaneous types
Numeric data types: Integral types are TINYINT, SMALLINT, INT and BIGINT. Floating types are
FLOAT, DOUBLE and DECIMAL.
String data types are string, varchar and char.
Date/Time data types: Hive provides DATE and TIMESTAMP data types in traditional UNIX time
stamp format for date/time related fields in hive. DATE values are represented in the form YYYY-MM-
DD. TIMESTAMP use the format yyyy-mm-ddhh:mm:ss[.f...].
Miscellaneous types: Hive supports two more primitive data types: BOOLEAN and BINARY. Hive
stores true or false values only.
Complex type is Array, Map, Struct and Union.
Array in Hive is an ordered sequence of similar type elements that are indexable using the zero-
based integers.
Map in Hive is a collection of key-value pairs, where the fields are accessed using array
notations of keys (e.g., ['key']).
STRUCT in Hive is similar to the STRUCT in C language. It is a record type that encapsulates a
set of named fields, which can be any primitive data type.
UNION type in Hive is similar to the UNION in C. UNION types at any point of time can hold
exactly one data type from its specified data types.
2. File formats :
Hive database: In Hive, the database is considered as a catalog or namespace of tables. It is also
common to use databases to organize production tables into logical groups. If we do not specify a
database, the default database is used.
Let's create a new database by using the following command:
hive> CREATE DATABASE Rollcall;
Make sure the database we are creating doesn't exist on Hive warehouse, if exists it throws Database
Rollcall already exists error.
At any time, we can see the databases that already exist as follows:
hive> SHOW DATABASES;
default
Rollcall
hive> CREATE DATABASE student;
hive> SHOW DATABASES;
default
Rollcall
student
Hive will create a directory for each database. Tables in that database will bestored in subdirectories of
the database directory. The exception is tables in thedefault database, which doesn't have its own
directory.
Drop Database Statement:
Syntax: DROP DATABASE StatementDROP (DATABASE | SCHEMA) [IF EXISTS]
database_name [RESTRICT | CASCADE];
Example: hive> DROP DATABASE IF EXISTS userid;
Write commands to create a following table in hbase and write commands to perform the following in
hbase. APR/MAY 2024
Row Data
Key Name Age City
1 Ravi 36 Coimbatore
2 Udaya 37 Salem
3 Rama 40 OOty
Creation of table emp
The syntax to create a table in HBase shell is shown below.
create ‘<table name>’,’<column family>’
⇒3
3 row(s) in 0.090 seconds
Write a user defined function in Pig Latin which performs the following using the sample dataset
provided. APR/MAY 2024
i) Assume the provided dataset is an excel sheet. Read the countries and customer data separately and
specify the resulting data structure
ii) Out of all the countries available find the asian countries.
iii) Find customers who belong to asia.
iv) For those customers find their customer names.
v) Sort the results in ascending order and save them intoa file
$ hadoop version
Step 2: Starting HDFS
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
$ start-yarn.sh
Step 3: Create a Directory in HDFS
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/customer_data.xls
hdfs dfs -put /home/Hadoop/Pig/Pig_Data/country_data.xls
dfs://localhost:9000/pig_data/
step 5: Verifying the file
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/ customer_data.xls
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/ country_data.xls
Join
grunt> customer_country = JOIN customer BY country_id, country BY country_id;
Filter
filter_data = FILTER customer_country BY country_region == 'Asia';
grunt> dump customer_country
Ascending
grunt> order_by_data = ORDER customer_country BY customer_name ASC;
UNIT-V
QUESTION BANK
1. What is Hbase? Draw architecture of Hbase. Explain difference between HDFS and Hbase.
2. Examine Hbase’s real world uses and benefits as a scalable and versatile NoSQL database.
Nov/Dec-2023.
3. Explain in details about data model and implementation of Hbase.
4. Briefly explain about Hbase clients with examples.
5. Write short note on Praxis.
6. What is Pig? Explain feature of Pig. Draw architecture of pig.
7. Draw and explain architecture of Hive.
8. Explain in details about data types and file formats of Hive.
9. Narrate the salient points on data manipulation in Hive using HiveQL. Nov/Dec-2023.