0% found this document useful (0 votes)
9 views

BDA.Unit-5

The document provides an overview of HBase, a non-relational distributed database built on Hadoop, detailing its architecture, features, applications, and differences from HDFS and relational databases. It explains the HBase data model, including tables, rows, column families, and cells, as well as client options for interacting with HBase, such as Java, MapReduce, Avro, REST, and Thrift. Additionally, it discusses the limitations of HBase and its use cases in handling large-scale data efficiently.

Uploaded by

Malathi Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

BDA.Unit-5

The document provides an overview of HBase, a non-relational distributed database built on Hadoop, detailing its architecture, features, applications, and differences from HDFS and relational databases. It explains the HBase data model, including tables, rows, column families, and cells, as well as client options for interacting with HBase, such as Java, MapReduce, Avro, REST, and Thrift. Additionally, it discusses the limitations of HBase and its use cases in handling large-scale data efficiently.

Uploaded by

Malathi Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 31

V.R.S.

College of Engineering and Technology


(Reaccredited by NAAC and an ISO 9001:2008 Recertified Institution)

SUBJECT NAME : BIG DATA ANALYTICS


SUBJECT CODE : CCS334
REGULATION : 2021
YEAR/SEMESTER : III/V
BRANCH : CSE

UNIT-V
HADOOP RELATED TOOLS
Hbase – data model and implementations – Hbase clients – Hbase examples – praxis.Pig – Grunt – pig
data model – Pig Latin – developing and testing Pig Latin scripts.Hive – data types and file formats –
HiveQL data definition – HiveQL data manipulation – HiveQL queries.

HBASE
Q. What is Hbase? Draw architecture of Hbase. Explain difference between HDFS and Hbase.
Definition:
 HBase is an open source, non-relational, distributed database modeled after Google's BigTable. HBase is
an open source and sorted map data built on Hadoop. It is column oriented and horizontally scalable.
 It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop file system. It runs on top of Hadoop and HDFS, providing Big Table-like capabilities for
Hadoop.
 HBase supports massively parallelized processing via MapReduce for using HBase as both source and
sink.
 HBase is a column oriented distributed database in Hadoop environment. It can store massive amounts
of data from terabytes to petabytes. HBase is scalable, distributed big data storage on top of the Hadoop
eco system.
 HBase supports an easy-to-use Java API for programmatic access. It also supports Thrift and REST for
non-Java front-ends.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 1


 The HBase physical architecture consists of servers in a Master-Slave relationship. Typically, the HBase
cluster has one Master node, called HMaster and multipleRegion Servers called HRegionServer.

Hbase architecture
 Zookeeper is a centralized monitoring server which maintains configuration information and provides
distributed synchronization. If the client wants to communicate with regions servers, client has to
approach Zookeeper.
 HMaster in the master server of Hbase and it coordinates the HBase cluster. HMaster is responsible for
the administrative operations of the cluster.
 HRegions servers: It will perform the following functions in communication with HMaster and
Zookeeper.
1. Hosting and managing regions.
2. Splitting regions automatically.
3. Handling read and writes requests.
4. Communicating with clients directly
 HRegions: For each column family, HRegions maintain a store. Main components of HRegions are
Memstore and Hfile.
 Data model in HBase is designed to accommodate semi-structured data that could vary in field size, data
type and columns.
 HBase is a column-oriented, non-relational database. This means that data is stored in individual
columns and indexed by a unique row key. This architecture allows for rapid retrieval of individual rows
and columns and efficient scans over individual columns within a table.
 Both data and requests are distributed across all servers in an HBase cluster, allowing user to query
results on petabytes of data within milliseconds. HBase is most effectively used to store non-relational
data, accessed via the HBase API.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 2


Features and Application of Hbase
Examine Hbase’s real world uses and benefits as a scalable and versatile NoSQL database. Nov/Dec-2023.
Features of Hbase:
1. Hbase is linearly scalable.
2. It has automatic failure support.
3. It provides consistent read and writes.
4. It integrates with Hadoop, both as a source and a destination.
5. It has easy java API for client.
6. It provides data replication across clusters.
Where to use Hbase ?
1. Apache Hbase is used to have random, real-time read/write access to Big Data.
2. It hosts very large tables on top of clusters of commodity hardware.
3. Apache Hbase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on
Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
Applications of Hbase :
1. It is used whenever there is a need to write heavy applications.
2. Hbase is used whenever we need to provide fast random access to available data.
3. Companies such as Facebook, Twitter, Yahoo and Adobe use HBase internally.
Difference between HDFS and Hbase
HDFS Hbase

HDFS is a distributed file system suitable HBase is a database built on top of the
for storing large files. HDFS.
HDFS does not support fast individual HBase provides fast lookups for larger
record lookups. tables.
It provides high latency batch processing; It provides low latency access to single
no concept of batch processing. rows from billions of records (Random
access).
It provides only sequential access of data. HBase internally uses Hash tables and
provides random access and it stores the
data in indexed HDFS files for faster
lookups.
HDFS are suited for high latency HBase is suited for low latency
operations. operations.
In HDFS, data are primarily accessed HBase provides access to single rows
through Map Reduce jobs. from billions of records.
HDFS doesn't have the concept of HBase data is accessed through shell
random read and write operations. commands, client API in Java. REST
Avro or Thrift.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 3


Difference between Hbase and Relational Database
HBase Relational Database
HBase is Schems-less Relational Database is based on a Fixed
Schema.
It is Column- oriented datastore. It is Row-oriented datastore.
It is designed to store denormalized It is designed to store normalized data.
data.
It contains wide and sparsely populated It contains thin tables
tables
Hbase supports automatic partitioning Relational database has no built-in
support for partitioning.
It is good for semi-structured as well It is good for structured data.
asstructured data.
No transactions are there in HBase. RDBMS is transactional.
Limitations of HBase
 It takes a very long time to recover if the HMaster goes down. It takes a long time to activate another
node if the first nodes go down.
 In HBase, cross data operations and join operations are very difficult to perform.
 HBase needs a new format when we want to migrate from RDBMS external sources to HBase servers.
 It is very challenging in HBase to support querying process.
 It takes enormous time to develop security factor to grant access to the users.
 HBase allows only one default sort for a table and it does not support large size of binary files.
 HBase is expensive in terms of hardware requirement and memory blocks' allocations.
DATA MODEL AND IMPLEMENTATIONS
Q. Explain in details about data model and implementation of Hbase.
 The Apache HBase Data Model is designed to accommodate structured or semi-structured data that
could vary in field size, data type and columns. HBase stores data in tables, which have rows and
columns. The table schema is very different from traditional relational database tables.
 A database consists of multiple tables. Each table consists of multiple rows, sorted by row key. Each row
contains a row key and one or more column families.
 Each column family is defined when the table is created. Column families can contain multiple columns.
(family: column). A cell is uniquely identified by(table,row,family: column). A cell contains an
uninterrupted array of bytes and a timestamp.
 HBase data model has some logical components which are as follows:
1. Tables 2. Rows
3. Column Families/Columns 4. Versions/Timestamp 5. Cells

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 4


 Tables: The HBase Tables are more like logical collection of rows stored in separate partitions called
Regions. As shown above, every Region is then served by exactly one Region Server.
 The syntax to create a table in HBase shell is shown below.
create ‘<table name>’’<column family>'
 Example: create 'CustomerContact Information',' CustomerName', ' Contactinfo’
 Tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset
of a table's rows. A region is denoted by the table it belongs to.

Region with table

 There is one region server per node. There are many regions in a region server. At any time, a given
region is pinned to a particular region server. Tables are split into regions and are scattered across region
servers. A table must have at least one region.
 Rows: A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a
Table and are always treated as a byte[ ].
 Column families: Data in a row are grouped together as Column Families. Each Column Family has
one more Columns and these Columns in a family are stored together in a low level storage file known
as HFile. Column Families form the basic unit of physical storage to which certain HBase features like
compression are applied.
 Columns: A Column Family is made of one or more columns. A Column is identified by a Column
Qualifier that consists of the Column Family name concatenated with the Column name using a colon

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 5


example: columnfamily:columnname. There can be multiple Columns within a Column Family and
Rows within a table can have varied number of Columns.
 Cell: A Cell stores data and is essentially a unique combination of rowkey, Column Family and the
Column (Column Qualifier). The data stored in a Cell is called its value and the data type is always
treated as byte[ ].
 Version: The data stored in a cell is versioned and versions of data are identified by the timestamp. The
number of versions of data retained in a column family is configurable and this value by default is 3.
 Time-to-Live: TTL is a built-in feature of HBase that ages out data based on its timestamp. This idea
comes in handy in use cases where data needs to be held only for certain duration of time. So, if on a
major compaction the timestamp is older than the specified TTL in the past, the record in question
doesn't get put in the HFile being generated by the major compaction; that is, the older records are
removed as a part of the normal upkeep of the table.
 If TTL is not used and an aging requirement is still needed, then a much more I/O intensive operation
would need to be done.
HBASE CLIENTS
Q. Briefly explain about Hbase clients with examples.
There are a number of client options for interacting with an HBase cluster. There are a number of client
options for interacting with an HBase cluster.
1. Java
 Hbase is written in Java.
 Example: Creating table and inserting data in Hbase table are shown in the following program.
public class ExampleClient
{
public static void main (String[] args) throws IOException
{
Configuration config = HBaseConfiguration.create();
// Create table
HBaseAdmin admin = new HBaseAdmin(config);
HTableDescriptorhtd = new HTable Descriptor("test");
HColumnDescriptorhcd = new HColumnDescription("data");
htd.addFamily(hcd);
admin.createTable(htd);
byte [] tablename = htd.getName();

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 6


// Run some operations -- a put
Htable table = = new HTable(config, tablename);
byte [] row1 = Bytes.toBytes("row1");
put pl = new put(row1);
byte [] databytes=Bytes.toBytes("data");
pl.add(databytes,Bytes.toBytes("FN"), Bytes.toBytes("value1"));
table.put(p1);
}
}
 To create a table, we need to first create an instance of HBase Admin and then ask it to create the table
named test with a single column family named data.
2. MapReduce
 HBase classes and utilities in the org.apache.hadoop.hbase.mapreducepackage facilitate using HBase as
a source and/or sink in MapReduce jobs. The TableInputFormat class makes splits on region boundaries
so maps are handed asingle region to work on. TheTableOutput Format will write the result of
MapReduce into HBase.
 Example: A MapReduce application to count the number of rows in an HBasetable
public class RowCounter {
static final String NAME="rowcounter";
static class RowCounterMapper
extendsTableMapper<ImmutableBytesWritable, Result> {
/** Counter enumeration to count the actual rows. */
public static enum Counters (ROWS)
@Override
public void map(ImmutableBytesWritable row, Result values,
Context context)
throws IOException {
for (KeyValue value: values.list()) {
if (value.getValue().length > 0) {
context.getCounter(Counters.ROWS).increment(1);
break;
}}}}
public static Job createSubmittable Job(Configuration conf, String[] args)

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 7


throws IOException {
String tableName = args[0];
Job job = new Job(conf, NAME + "_" + tableName);
job.setJarByClass(RowCounter.class);
// Columns are space delimited
StringBuildersb = new StringBuilder();
final intcolumnoffset = 1;
for (inti = columnoffset; i<args.length; i++) {
if (i>columnoffset) {
sb.append(" ");
}
sb.append(args[i]);
}
Scan scan = new Scan();
scan.setFilter(new FirstKeyOnlyFilter());
if (sb.length() > 0) {
for (String columnName :sb.toString().split("")) {
String [] fields = columnName.split(":");
if(fields.length == 1) {
scan.addFamily (Bytes.toBytes(fields[0]));
} else {
scan.addColumn (Bytes.toBytes(fields [0]), Bytes.toBytes(fields[1]));
}}}
// Second argument is the table name.
job.setOutputFormat Class (NullOutputFormat.class);
TableMapReduceUtil.initTableMapperJob(tableName, scan,
RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);
job.setNumReduce Tasks(0);
return job;
}
public static void main(String[] args) throws Exception {
Configuration conf= HBaseConfiguration.create();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 8


if (otherArgs.length< 1){
System.err.println("ERROR: Wrong number of parameters: " + args.length);
System.err.println("Usage: RowCounter<tablename> [<column1><column2>...]");
System.exit(-1);
}
Job job = createSubmittable Job(conf, otherArgs);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
3. Avro, REST, and Thrift
 HBase ships with Avro, REST and Thrift interfaces. These are useful when the interacting application is
written in a language other than Java. In all cases, a Javaserver hosts an instance of the HBase client
brokering application Avro, REST, and Thrift requests in and out of the HBase cluster. This extra work
proxying requests and responses means these interfaces are slower than using the Java client directly.
 REST: To put up a stargate instance, start it using the following command:
% hbase-daemon.sh start rest
 This will start a server instance, by default on port 8080, background it and catch any emissions by the
server in log files under the HBase logs directory.
 Clients can ask for the response to be formatted as JSON, Google's protobufs, or as XML, depending on
how the client HTTP Accept header is set.
 To stop the REST server, type:
% hbase-daemon.sh stop rest
 Thrift: Start a Thrift service by putting up a server to field Thrift clients by running the following:
% hbase-daemon.sh start thrift
 This will start the server instance, by default on port 9090, background it and catch any emissions by the
server in log files under the HBase logs directory. The HBase Thrift documentation* notes the Thrift
version used generating classes.
 To stop the Thrift server, type:
% hbase-daemon.sh stop thrift
 Avro: The Avro server is started and stopped in the same manner as we startand stop the Thrift or REST
services. The Avro server by default uses port 9090.
Praxis
Q. Write short note on Praxis.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 9


 When Hbase cluster running under load, following issues are considered:
1. Versions: A particular Hbase version would run on any Hadoop that had a matching minor version.
HBase 0.20.5 would run on an Hadoop 0.20.2, but HBase 0.19.5 would not run on Hadoop 0.20.0
2. HDFS: In MapReduce, HDFS files are opened, with their content streamed through a map task and
then closed. In HBase, data files are opened on cluster startup and kept open. Because of this, HBase
tends to see issues not normally encountered by MapReduce clients.
 Running out of file descriptors Because of open files on a loaded cluster, it doesn't take long before we
run into system and Hadoop - imposed limits. Each open file consumes at least one descriptor over on
the remote datanode. The default limit on the number of file descriptors per process is 1,024. HBase
process is running with sufficient file descriptors by looking at the first few lines of a regionservers log.
 Running out of datanode threads The Hadoop datanode has an upper bound of 256 on the number of
threads it can run at any one time.
 Sync: We must run HBase on an HDFS that has a working sync. Otherwise, there is loss of data. This
means running HBase on Hadoop 0.21.x, which adds a working sync/append to Hadoop 0.20.
 UI: HBase runs a web server on the master to present a view on the state of running cluster. By default,
it listens on port 60010. The master UI displays a list of basic attributes such as software versions,
cluster load, request rates, lists of cluster tables and participating regionservers.
 Schema Design: HBase tables are like those in an RDBMS, except that cells are versioned, rows are
sorted and columns can be added on the fly by the client as long as the column family they belong to
preexists.
 Joins There is no native database join facility in HBase, but wide tables can make it so that there is no
need for database joins pulling from secondary or tertiary tables. A wide row can sometimes be made to
hold all data that pertains to a particular primary key.
Pig
Q. What is Pig? Explain feature of Pig. Draw architecture of pig.
 Pig is an open-source high level data flow system. A high-level platform for creating MapReduce
programs used in Hadoop. It translates into efficient sequences of one or more MapReduce jobs.
 Pig offers a high-level language to write data analysis programs which we call as Pig Latin. The salient
property of pig programs is that their structure is amenable to substantial parallelization, which in turns
enables them to handle very large data sets.
 Pig makes use of both, the Hadoop Distributed File System as well as the MapReduce.
Features of Pig Hadoop:

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 10


1. In-built operators Apache Pig provides a very good set of operator’s for performing several data
operations like sort, join, filter, etc.
2. Ease of programming.
3. Automatic optimization: The tasks in Apache Pig are automatically optimized.
4. Handles all kinds of data: Apache Pig can analyze both structured and unstructured data and
store the results in HDFS.
 Pig has two execution modes:
1. Local mode: To run pig in local mode, we need access to a single machine; all files are installed and
run using local host and file system. Specify local mode using the -x flag (pig-x local).
2. Mapreduce mode : To run pig in mapreduce mode, we need access to a Hadoop cluster and HDFS
installation. Mapreduce mode is the default mode; but don't need to, specify it using the -x flag.
 Pig Hadoop framework has four main components:
1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser. The parser is
responsible for checking the syntax of the script, along with other miscellaneous checks. Parser gives
an output in the form of a Directed Acyclic Graph (DAG) that contains Pig Latin statements,
together with other logical operators represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is passed to a
logical optimizer. The optimizer is responsible for carrying out the logical optimizations.
3. Compiler: The role of the compiler comes in when the output from the optimizer is received. The
compiler compiles the logical plan sent by the optimize the logical plan is then converted into a
series of MapReduce tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs are sent to
Hadoop in a properly sorted order and these jobs are executed on Hadoop for yielding the desired
result.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 11


Pig architecture
 Pig can run on two types of environments: The local environment in a single JVMor the distributed
environment on a Hadoop cluster.
 Pig has variety of scalar data types and standard data processing options. Pig supports Map data; a map
being a set of key - value pairs.
 Most pig operators take a relation as an input and give a relation as the output. It allows normal
arithmetic operations and relational operations too.
 Pig's language layer currently consists of a textual language called Pig Latin. PigLatin is a data flow
language. This means it allows users to describe how data from one or more inputs should be read,
processed and then stored to one or more outputs in parallel.
 These data flows can be simple linear flows, or complex workflows that include points where multiple
inputs are joined and where data is split into multiple streams to be processed by different operators. To
be mathematically precise, a PigLatin script describes a directed acyclic graph (DAG), where the edges
are data flows and the nodes are operators that process the data.
 The first step in a Pig program is to LOAD the data, which we want to manipulate from HDFS. Then run
the data through a set of transformations, Finally, DUMP the data to the screen or STORE the results in
a file somewhere.
Advantages of Pig:
1. Fast execution that works with MapReduce, Spark and Tez.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 12


2. Its ability to process almost any amount of data, regardless of size.
3. A strong documentation process that helps new users learn Pig Latin.
4. Local and remote interoperability that lets professionals work from anywhere witha reliable
connection.
Pig disadvantages:
1. Slow start-up and clean-up of MapReduce jobs
2. Not suitable for interactive OLAP analytics
3. Complex applications may require many users defined function.
PIG DATA MODEL
 With Pig, when the data is loaded the data model is specified. Any data that we load from the disk into
Pig will have a specific schema and structure. Pig data model is rich enough to manage most of what's
thrown in its way like table – like structures and nested hierarchical data structures.
 However, Pig data types can be divided into two groups in general terms: Scalar forms and complex
types.
 Scalar types contain a single value, while complex types include other values, such as the values of
Tuple, Container and Map.
 In its data model, Pig Latin has those four types:
 Atom: An atom is any single attribute, like, for example, a string or a number 'Hadoop'. The
atomic values of Pig are scalar forms that appear, for example, in most programming languages,
int, long, float, double, char array and byte array.
 Tuple: A tuple is a record generated by a series of fields. For example, each field can be of any
form, 'Hadoop,',' or 6. Just think of a tuple in a table as a row.
 Bag: A pocket is a set of tuples, which are not special. The bag's schema is flexible, each tuple in
the set can contain an arbitrary number of fields and can be of any sort.
 Map A map is a set of pairs with main values. The value can store any type and the key needs to
be unique. A char array must be the key of a map and the value may be of any kind.
PIG LATIN
 The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a textual
language that abstracts the programming from the JavaMapReduce idiom into a notation.
 The Pig Latin statements are used to process the data. It is an operator that accepts a relation as an input
and generates another relation as an output.
a. It can span multiple lines.
b. Each statement must end with a semi-colon.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 13


c. It may include expression and schemas.
d. By default, these statements are processed using multi - query execution
 Pig Latin statements work with relations. A relation can be defined as follows:
a. A relation is a bag (more specifically, an outer bag).
b. A bag is a collection of tuples.
c. A tuple is an ordered set of fields.
d. A field is a piece of data.
 Pig Latin Datatypes
1. Int: "Int" represents a signed 32-bit integer. For Example: 13
2. Long It represents a signed 64-bit integer. For Example: 13L
3. Float: This data type represents a signed 32-bit floating point. For Example :130.5F
4. Double "double" represents a 64-bit floating point. For Example: 13.5
5. Chararray: It represents a character array (string) in Unicode UTF-8 format. For Example: 'Big
Data'
6. Bytearray: This data type represents a Byte array.
7. Boolean: "Boolean" represents a Boolean value. For Example: true/ false.
DEVELOPING AND TESTING PIG LATIN SCRIPTS
 Pig provides several tools and diagnostic operators to help us to develop applications.
 Scripts in Pig can be executed in interactive or batch mode. To use pig in interactive mode, we invoke it
in local or map-reduce mode then enter commands one after the other. In batch mode, we save
commands in a pig file and specify the path to the file when invoking pig.
 At an overly simplified level a Pig script consists of three steps. In the first step we load data from
HDFS. In the second step we perform transformations on the data. In the final step we store transformed
data. Transformations are the heart of Pig scripts.
 Pig has a schema concept that is used when loading data to specify what it should expect. First specify
columns and optionally their data types. Any columns in data but not included in the schema are
truncated.
 When we have fewer columns than those specified in schema they are filled with nulls. To load sample
data sets we first move them to HDFS then from there we will load into Pig.
 Pig Script Interfaces: Pig programs can be packaged in three different ways.
1. Script: This function is nothing more than a file consists of Pig Latin commands, identified by
the .pig suffix. Ending Pig program with the .pig extension is a convention but not required. The

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 14


command is interpreted by the Pig Latin compiler and then runs in the order determined by the Pig
optimizer.
2. Grunt: Grunt acts as a command interpreter where we can interactively enter Pig Latin at the Grunt
command line and immediately see the response. This method is useful for prototyping during early
development stage and with what-if scenarios.
3. Embedded: Pig Latin statements can run within Java, JavaScript and Python programs.
 Pig scripts, Grunt shell Pig commands and embedded Pig programs can be executed in either Local
mode or on MapReduce mode. The Grunt shell enables an interactive shell to submit Pig commands and
run Pig scripts. To start the Grunt shell in Interactive mode, we need to submit the command pig at the
shell.
 To tell the complier whether a script or Grunt shell is executed locally or in Hadoop mode just specify it
in the -x flag to the pig command. The following is an example of how we would specify running our
Pig script in local mode:
pig -x local mindstick.pig
 Here's how we would run the Pig script in Hadoop mode, which is the default if we don't specify the
flag:
pig -x mapreducemindstick.pig
 By default, when we specify the pig command without any parameters, it starts the Grunt shell in
Hadoop mode. If we want to start the Grunt shell in local mode just add the -x local flag to the
command.
HIVE
Q. Draw and explain architecture of Hive.
 Apache Hive is open-source data warehouse software for reading, writing and managing large data set
files that are stored directly in either the Apache Hadoop Distributed File System (HDFS) or other data
storage systems such as Apache HBase.
 Data analysts often use Hive to analyze data, query large amounts of unstructured data and generate
data summaries.
 Features of Hive :
1. It stores schema in a database and processes data into HDFS.
2. It is designed for OLAP.
3. It provides SQL type language for querying called HiveQL or HQL.
4. It is familiar, fast, scalable and extensible.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 15


 Hive supports variety of storage formats: TEXTFILE for plaintext, SEQUENCEFILE for binary key-
value pairs, RCFILE stores columns of a table in a record columnar format.
 Hive table structure consists of rows and columns. The rows typically correspond to some record,
transaction, or particular entity detail.
 The values of the corresponding columns represent the various attributes or characteristics for each row.
 Hadoop and its ecosystem are used to apply some structure to unstructured data. Therefore, if a table
structure is an appropriate way to view the restructured data, Hive may be a good tool to use.
 Following are some Hive use cases:
1. Exploratory or ad-hoc analysis of HDFS data: Data can be queried, transformed and exported to
analytical tools.
2. Extracts or data feeds to reporting systems, dashboards, or data repositories such as HBase.
3. Combining external structured data to data already residing in HDFS.
Advantages :
1. Simple querying for anyone already familiar with SQL.
2. Its ability to connect with a variety of relational databases, including Postgres and MySQL.
3. Simplifies working with large amounts of data.
Disadvantages :
1. Updating data is complicated
2. No real time access to data
3. High latency.
 Program Example: Write a code in JAVA for a simple Word Count application that counts the number
of occurrences of each word in a given input set using the Hadoop Map-Reduce framework on local-
standalone set-up.
import java.io.IOException;
import java.util.StringTokenizer;
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreducelib.input.FileInputFormat;

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 16


import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable (1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizeritr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
public void reduce (Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritableval: values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass (WordCount.class);

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 17


job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass (IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Hive Architecture

 User Interface: Hive is data warehouse infrastructure software that cans create interaction between user
and HDFS.
 The user interfaces that Hive supports are Hive Web UI, Hive command line and Hive HD Insight.
 Meta Store: Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types and HDFS mapping.
 HiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is
one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce
program in Java, we can write a query for MapReduce job and process it.
 Execution engine : The conjunction part of HiveQL process Engine and MapReduce is Hive Execution
Engine. Execution engine processes the query and generates results as same as MapReduce results. It
uses the flavor of MapReduce.
 HDFS or HBASE: Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.

Working of Hive :
Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 18
Hive working

1. Execute query: The Hive interface such as command line or Web UI sends query to driver to
execute.
2. Get plan: The driver takes the help of query compiler that parses the query to check the syntax
and query plan or the requirement of query.
3. Get metadata: The compiler sends metadata request to metastore.
4. Send metadata: Metastore sends metadata as a response to the compiler.
5. Send plan: The compiler checks the requirement and resends the plan to thedriver. Up to here,
the parsing and compiling of a query is complete.
6. Execute plan: The driver sends the execute plan to the execution engine.
7. Execute job: Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to TaskTracker, which
is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops: Meanwhile in execution, the execution engine can execute metadata
operations with Metastore.
8. Fetch result: The execution engine receives the results from data nodes.
9. Send results: The execution engine sends those resultant values to the driver.
10. Send results: The driver sends the results to Hive Interfaces.
DATA TYPES AND FILE FORMATS

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 19


Q. Explain in details about data types and file formats of Hive.

1. Data types :
 Hive data types can be classified into two categories: Primary data types and Complex data types.
 Primary data types are of four types: Numeric, string, date/time and miscellaneous types
 Numeric data types: Integral types are TINYINT, SMALLINT, INT and BIGINT. Floating types are
FLOAT, DOUBLE and DECIMAL.
 String data types are string, varchar and char.
 Date/Time data types: Hive provides DATE and TIMESTAMP data types in traditional UNIX time
stamp format for date/time related fields in hive. DATE values are represented in the form YYYY-MM-
DD. TIMESTAMP use the format yyyy-mm-ddhh:mm:ss[.f...].
 Miscellaneous types: Hive supports two more primitive data types: BOOLEAN and BINARY. Hive
stores true or false values only.
 Complex type is Array, Map, Struct and Union.
 Array in Hive is an ordered sequence of similar type elements that are indexable using the zero-
based integers.
 Map in Hive is a collection of key-value pairs, where the fields are accessed using array
notations of keys (e.g., ['key']).
 STRUCT in Hive is similar to the STRUCT in C language. It is a record type that encapsulates a
set of named fields, which can be any primitive data type.
 UNION type in Hive is similar to the UNION in C. UNION types at any point of time can hold
exactly one data type from its specified data types.

2. File formats :

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 20


 In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each
record has to be its own structure. How records are encoded in a file defines a file format. These file
formats mainly vary between data encoding, compression rate, usage of space and disk I/O.
 Hive support file format: TEXTFILE, SEQUENCEFILE, RCFILE and ORCFILE.
 TEXTFILE format is a famous input/output format used in Hadoop. In Hive if we define a table as
TEXTFILE it can load data of from CSV (Comma Separated Values), delimited by Tabs, Spaces and
JSON data.
 Sequence files are flat files consisting of binary key value pairs. When Hive converts queries to
MapReduce jobs, it decides on the appropriate key – value pairs to be used for a given record. Sequence
files are in the binary format which can be split and the main use of these files is to club two or smaller
files and makes them as a one sequence file. In Hive we can create a sequence file by specifying
STORED AS SEQUENCEFILE in the end of a CREATE TABLE statement.
 RCFILE stands of Record Columnar File which is another type of binary fileformat which offers high
compression rate on the top of the rows. RCFILE is used when we want to perform operations on
multiple rows at a time. RCFILES are flatfiles consisting of binary key/value pairs.
 Facebook uses RCFILE as its default file format for storing of data in their datawarehouse as they
perform different types of analytics using Hive.
 ORCFILE: ORC stands for Optimized Row Columnar which means it can store data in an optimized
way than the other file formats. ORC reduces the size of the original data up to 75 %. An ORC file
contains rows data in groups called as Stripes along with a file footer. ORC format improves the
performance when Hive is processing the data.
HIVEQL DATA DEFINITION
Q. Write short note on HivQL data definition, manipulation and queries.
Narrate the salient points on data manipulation in Hive using HiveQL. Nov/Dec-2023.
 HiveQL is the Hive query language. Hive offers no support for row level inserts updates and deletes.
Hive doesn't support transactions. DDL statements are used to define or change Hive databases and
database objects.
 Types of Hive DDL commands are: CREATE, SHOW, DESCRIBE, USE, DROP, ALTER and
TRUNCATE.

Hive DDL commands

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 21


DDL command Use with
CREATE Database, table
SHOW Databases, tables, Table properties, Partitions,
Functions, Index
DESCRIBE Database, Table, view
USE Database
DROP Database, Table
ALTER Database, Table
TRUNCATE Table

 Hive database: In Hive, the database is considered as a catalog or namespace of tables. It is also
common to use databases to organize production tables into logical groups. If we do not specify a
database, the default database is used.
 Let's create a new database by using the following command:
hive> CREATE DATABASE Rollcall;
 Make sure the database we are creating doesn't exist on Hive warehouse, if exists it throws Database
Rollcall already exists error.
 At any time, we can see the databases that already exist as follows:
hive> SHOW DATABASES;
default
Rollcall
hive> CREATE DATABASE student;
hive> SHOW DATABASES;
default
Rollcall
student
 Hive will create a directory for each database. Tables in that database will bestored in subdirectories of
the database directory. The exception is tables in thedefault database, which doesn't have its own
directory.
 Drop Database Statement:
Syntax: DROP DATABASE StatementDROP (DATABASE | SCHEMA) [IF EXISTS]
database_name [RESTRICT | CASCADE];
Example: hive> DROP DATABASE IF EXISTS userid;

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 22


 ALTER DATABASE: The ALTER DATABASE statement in Hive is used to change the metadata
associated with the database in Hive. Syntax for changing Database Properties:
ALTER (DATABASE SCHEMA) database_name SET DBPROPERTIES
(property_name=property_value, ...);
HIVEQL DATA MANIPULATION
 Data manipulation language is subset of SQL statements that modify the data stored in tables. Hive has
no row - level insert, update and deletes operations; the only way to put data into an table is to use one
of the "bulk" load operations.
Inserting data into tables from queries:
 The INSERT statement performs loading data into a table from a query.
INSERT OVERWRITE TABLE students
PARTITION (branch = 'CSE', classe = 'OR')
SELECT FROM college students se
WHERE se.bra= 'CSE' AND se.cla = 'OR';
 With OVERWRITE, any previous contents of the partition are replaced. If we drop the keyword
OVERWRITE or replace it with INTO, Hive appends the data rather than replaces it. This feature is only
available in Hive v0.8.0 or later.
 We can mix INSERT OVERWRITE clauses and INSERT INTO clauses, as well.
Dynamic partition inserts :
 Hive also supports a dynamic partition feature, where it can infer the partitions to create based on query
parameters. Hive determines the values of the partition keys, from the last two columns in the SELECT
clause.
 The static partition keys must come before the dynamic partition keys. Dynamic partitioning is not
enabled by default. When it is enabled, it works in "strict" mode by default, where it expects at least
some columns to be static. This helps protect against a badly designed query that generates a gigantic
number of partitions.
 Hive Data Manipulation Language (DML) Commands
a. LOAD The LOAD statement transfers data files into the locations that correspond to Hive tables.
b. SELECT The SELECT statement in Hive functions similarly to the SELECT statement in SQL.
It is primarily for retrieving data from the database.
c. INSERT The INSERT clause loads the data into a Hive table. Users can also perform an insert to
both the Hive table and/or partition.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 23


d. DELETE- The DELETE clause deletes all the data in the table. Specific data can be targeted and
deleted if the WHERE clause is specified.
e. UPDATE- The UPDATE command in Hive updates the data in the table. If the query includes
the WHERE clause, then it updates the column of the rows that meet the condition in the
WHERE clause.
f. EXPORT- The Hive EXPORT command moves the table or partition data together with the
metadata to a designated output location in the HDFS.
g. IMPORT- The Hive IMPORT statement imports the data from a particularized location to a new
or currently existing table.
HIVEQL QUERIES
 The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data
in a Metastore. Hive Query Language is used for processing and analyzing structured data. It separates
users from the complexity of Map Reduce programming.
SELECT... FROM Clauses
 SELECT is the projection operator in SQL. The FROM clause identifies from which table, view or
nested query we select records. For a given record, SELECT specifies the columns to keep, as well as
the outputs of function calls on one or more columns.
 Here's the syntax of Hive's SELECT statement.
SELECT (ALL | DISTINCT] select_expr, select_expr,...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number]
;
 SELECT is the projection operator in HiveQL. The points are:
a. SELECT scans the table specified by the FROM clause
b. WHERE gives the condition of what to filter
c. GROUP BY gives a list of columns which specify how to aggregate the records
d. CLUSTER BY, DISTRIBUTE BY, SORT BY specify the sort order and algorithm
e. LIMIT specifies how many # of records to retrieve.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 24


Computing with Columns
 When we select the columns, we can manipulate column values using either arithmetic operators or
function calls. Math, date and string functions are also popular.
 Here's an example query that uses both operators and functions.
SELECT upper(name), sales_cost FROM products;
 WHERE Clauses: A WHERE clause is used to filter the result set by using predicate operators and
logical operators. Functions can also be used to compute the condition.
 GROUP BY Clauses: A GROUP BY clause is frequently used with aggregate functions, to group the
result set by columns and apply aggregate functions over each group. Functions can also be used to
compute the grouping key.

Write commands to create a following table in hbase and write commands to perform the following in
hbase. APR/MAY 2024

Row Data
Key Name Age City
1 Ravi 36 Coimbatore
2 Udaya 37 Salem
3 Rama 40 OOty
Creation of table emp
The syntax to create a table in HBase shell is shown below.
create ‘<table name>’,’<column family>’

hbase(main):002:0> create 'emp','Data'


And it will give you the following output.
0 row(s) in 1.1300 seconds
=> Hbase::Table - emp
Verification
You can verify whether the table is created using the list command as shown below. Here you can observe the
created emp table.
hbase(main):002:0> list
TABLE
emp
2 row(s) in 0.0340 seconds

Enter the data into the table.


put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
hbase(main):005:0> put 'emp','1','Data:Name','Ravi'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','1','Data:Age','36'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','1','Data:City','Coimbatore'
0 row(s) in 0.0240 seconds
hbase(main):007:0> put 'emp','2','Data:Name','udaya'

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 25


0 row(s) in 0.0240 seconds
hbase(main):006:0> put 'emp','2','Data:Age','37'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','2','Data:City','Salem'
0 row(s) in 0.0410 seconds
hbase(main):005:0> put 'emp','3','Data:name','Rama'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','3','Data:Age','40'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','3','Data:City','OOty'

(i)Update Age in row 2 to 20


hbase(main):006:0> put 'emp','2','Data:Age','20'
0 row(s) in 0.0410 seconds
(ii)Show all the rows with the value of age above 35
get ‘emp’, {COLUMN=> ‘Data:Age>35’}
COLUMN CELL
Data:name timestamp = 1418035791585, value = Ravi
Data:Age timestamp = 1418035791553, value = 36
Data:city timestamp = 1418035791555, value = Coimbatore
Data:name timestamp = 1418035791556, value = Rama
Data:Age timestamp = 14180357915557 value = 40
Data:City timestamp = 1418035791525, value = OOty
2 row(s) in 0.0080 seconds
(iii)Add gender information for all the rows in the table
hbase(main):005:0> put 'emp','1','Data:Gender','M'
0 row(s) in 0.6600 seconds
hbase(main):006:0> put 'emp','2','Data:Gender','M'
0 row(s) in 0.0410 seconds
hbase(main):007:0> put 'emp','3','Data:Gender','F'
0 row(s) in 0.0240 seconds
(iv)count the number of entries in the table and print the count.
hbase(main):023:0> count 'emp'

⇒3
3 row(s) in 0.090 seconds

(v)command to drop the table.


disable ‘emp’
drop ‘emp’

Write a user defined function in Pig Latin which performs the following using the sample dataset
provided. APR/MAY 2024
i) Assume the provided dataset is an excel sheet. Read the countries and customer data separately and
specify the resulting data structure
ii) Out of all the countries available find the asian countries.
iii) Find customers who belong to asia.
iv) For those customers find their customer names.
v) Sort the results in ascending order and save them intoa file

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 26


Customer_id Customer Gender City Country_id
Name
101 Ajay M Kabul 1
102 Baddri M New Delhi 3
103 Carolyn F Nairobi 4
104 Daniel M Cape Town 5
105 Edwin M Ottawa 6
106 Fathima F Chicago 7
107 ganga F islambad 2

Country_id Country Country_Region


Name
1 Afghanistan Asia
5 South africa Africa
2 Pakistan Asia
4 Kenya Africa
3 India Asia
6 Canada North America
7 United states North America

Step 1: Verifying Hadoop

$ hadoop version
Step 2: Starting HDFS
cd /$Hadoop_Home/sbin/
$ start-dfs.sh
$ start-yarn.sh
Step 3: Create a Directory in HDFS
$cd /$Hadoop_Home/bin/
$ hdfs dfs -mkdir hdfs://localhost:9000/Pig_Data
Step 4: Placing the data in HDFS
$ cd $HADOOP_HOME/bin
$ hdfs dfs -put /home/Hadoop/Pig/Pig_Data/customer_data.xls
hdfs dfs -put /home/Hadoop/Pig/Pig_Data/country_data.xls

dfs://localhost:9000/pig_data/
step 5: Verifying the file
$ cd $HADOOP_HOME/bin
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/ customer_data.xls
$ hdfs dfs -cat hdfs://localhost:9000/pig_data/ country_data.xls

Relation_name = LOAD 'Input file path' USING function as schema;


$ Pig –x mapreduce

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 27


Execute the Load Statement
grunt> customer = LOAD 'hdfs://localhost:9000/pig_data/ customer_data.xls'
USING PigStorage(',')
as ( customer_id:int, customername:chararray, Gender:chararray, city:chararray, country_id:int );

grunt> country = LOAD 'hdfs://localhost:9000/pig_data/ country_data.xls'


USING PigStorage(',')
as ( country_id:int, countryname:chararray, country_region:chararray, city:chararray, country_id:int );

Store the data


STORE Relation_name INTO ' required_directory_path ' [USING function];
grunt> STORE customer INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');

grunt> customer_details = LOAD 'hdfs://localhost:9000/pig_data/ customer_data.xls' USING PigStorage(',')


as ( customer_id:int, customername:chararray, Gender:chararray, city:chararray, country_id:int );

grunt> country_details = LOAD 'hdfs://localhost:9000/pig_data/ country_data.xls'


USING PigStorage(',')
as ( country_id:int, countryname:chararray, country_region:chararray, city:chararray, country_id:int );

Co-Group the data


grunt> cogroup_data = COGROUP customer_details by country_id, country_details by country_id;

Join
grunt> customer_country = JOIN customer BY country_id, country BY country_id;

Filter
filter_data = FILTER customer_country BY country_region == 'Asia';
grunt> dump customer_country

Ascending
grunt> order_by_data = ORDER customer_country BY customer_name ASC;

Two Marks Questions with Answers


Q.1 What is HBase ?
Ans.: HBase is a distributed column - oriented database built on top of the Hadoop file system. It is an open-
source project and is horizontally scalable. HBase is a data model that is similar to Google's big table designed
to provide quick random access to huge amounts of structured data.
Q.2 What is Hive ?
Ans.: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries and
the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project
structure onto this data and query the data using a SQL-like language called HiveQL.
Q.3 What is Hive data definition ?

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 28


Ans.: Hive data definition assigns relational structure to the files stored on the HDFS cluster. We can easily
query the structured data to extract specific information. For example, data definition for log files would contain
columns like: CLASS, FILENAME,MESSAGE, LINENUBER, etc.
Q4 Explain services provided by Zookeeper in Hbase.
Ans.: Various services that Zookeeper provides include:
a) Establishing client communication with region servers.
b) Tracking server failure and network partitions.
c) Maintain configuration information
d) Provides ephemeral nodes, which represent different region servers.
Q.5 What is Zookeeper ?
Ans.ZooKeeper service keeps track of all the region servers that are there in an HBase cluster tracking
information about how many region servers are there and which region servers are holding which DataNode.
Q.6 What are the responsibilities of HMaster ?
Ans.Responsibilities of HMaster :
a) Manages and monitors the Hadoop cluster
b) Performs administration
c) Controlling the failover
d) DDL operations are handled by the HMaster
e) Whenever a client wants to change the schema and change any of themetadata operations, HMaster is
responsible for all these operations.
Q.7 Where to Use HBase ?
Ans. Hadoop HBase is used to have random real time access to the big data. It can host large tables on top of
cluster commodity; HBase is a non relational database which modeled after Google's big table. It works similar
to a big table to store the files of Hadoop.
Q.8 Explain unique features of Hbase ?
Ans.:
 HBase is built for low latency operations
 HBase is used extensively for random read and write operations
 HBase stores a large amount of data in terms of tables
 Automatic and configurable sharding of tables
 HBase stores data in the form of key/value pairs in a columnar model

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 29


Q.9 Explain data model in Hbase ?
Ans. The data model in HBase is designed to accommodate semi-structured data that could vary in field size,
data type and columns. Additionally, the layout of the data model makes it easier to partition the data and
distribute it across the cluster.
Q.10 What is the difference between Pig Latin and Pig engine ?
Ans.: Pig Latin is a scripting language similar to Perl used to search large data sets. It is composed of a
sequence of transformations and operations that are applied to the input data to create data.
The Pig engine is the environment in which Pig Latin programs are executed. It translates Pig Latin operators
into MapReduce jobs.
Q.11 What is pig storage ?
Ans. Pig has a built-in load function called pig storage. In addition, whenever we wish to import data from a file
system into the Pig, we can use Pig storage.
Q.12 What are the features of Hive ?
Ans. :
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable and extensible.
Q.13 Explain the primary purpose of HiveQL queries in Hive ecosystem/ Write short note on HiveQL
queries. NOV/ DEC 2023, APR/MAY 2024
Ans. The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data
in a Metastore. Hive Query Language is used for processing and analyzing structured data. It separates users
from the complexity of Map Reduce programming.
SELECT... FROM Clauses
SELECT is the projection operator in SQL. The FROM clause identifies from which table, view or nested query
we select records.
Q.14.Mention the data types in Hive. APR/MAY 2024
Ans. Hive data types can be classified into two categories: Primary data types and Complex data types.
 Primary data types are of four types: Numeric, string, date/time and miscellaneous types
 Numeric data types: Integral types are TINYINT, SMALLINT, INT and BIGINT. Floating types are
FLOAT, DOUBLE and DECIMAL.
 String data types are string, varchar and char.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 30


 Date/Time data types: Hive provides DATE and TIMESTAMP data types in traditional UNIX time
stamp format for date/time related fields in hive. DATE values are represented in the form YYYY-MM-
DD. TIMESTAMP use the format yyyy-mm-ddhh:mm:ss[.f...].
 Miscellaneous types: Hive supports two more primitive data types: BOOLEAN and BINARY. Hive
stores true or false values only.
 Complex type is Array, Map, Struct and Union.
Q.15. Difference between Hbase and Relational Database NOV/DEC 2023
HBase Relational Database
HBase is Schema-less Relational Database is based on a Fixed
Schema.
It is Column- oriented datastore. It is Row-oriented datastore.
It is designed to store denormalized It is designed to store normalized data.
data.
It contains wide and sparsely populated It contains thin tables
tables
Hbase supports automatic partitioning Relational database has no built-in
support for partitioning.
It is good for semi-structured as well It is good for structured data.
asstructured data.
No transactions are there in HBase. RDBMS is transactional.

UNIT-V
QUESTION BANK
1. What is Hbase? Draw architecture of Hbase. Explain difference between HDFS and Hbase.
2. Examine Hbase’s real world uses and benefits as a scalable and versatile NoSQL database.
Nov/Dec-2023.
3. Explain in details about data model and implementation of Hbase.
4. Briefly explain about Hbase clients with examples.
5. Write short note on Praxis.
6. What is Pig? Explain feature of Pig. Draw architecture of pig.
7. Draw and explain architecture of Hive.
8. Explain in details about data types and file formats of Hive.
9. Narrate the salient points on data manipulation in Hive using HiveQL. Nov/Dec-2023.

Prepared By Mrs.C.Leena AP/CSE CCS334-Big Data Analytics Page 31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy