0% found this document useful (0 votes)
20 views

Module V

This document provides an overview of NoSQL databases and their aggregate data models. It discusses key concepts of aggregate-oriented databases including that they do not support ACID transactions and instead support atomic operations on single aggregates. It also summarizes four major aggregate data models - key-value, document, column family, and graph - and provides examples of HBase, a column family database, discussing its data model, clients and put and get operations.

Uploaded by

manoj mlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Module V

This document provides an overview of NoSQL databases and their aggregate data models. It discusses key concepts of aggregate-oriented databases including that they do not support ACID transactions and instead support atomic operations on single aggregates. It also summarizes four major aggregate data models - key-value, document, column family, and graph - and provides examples of HBase, a column family database, discussing its data model, clients and put and get operations.

Uploaded by

manoj mlp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

MODULE V BIG DATA FRAMEWORKS 9

Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and


Implementations – Hbase Clients – Examples – .Cassandra: Data Model – Examples –
Cassandra Clients – Hadoop Integration. Pig – Grunt – Pig Data Model – Pig Latin –
developing and testing Pig Latin scripts. Hive – Data Types and File Formats – HiveQL Data
Definition – HiveQL Data Manipulation – HiveQL Queries.

Introduction to NoSQL
Aggregate Data Models:
The term aggregate means a collection of objects that we use to treat as a unit. An
aggregate is a collection of data that we interact with as a unit. These units of data or
aggregates form the boundaries for ACID operation.

Here in the diagram have two Aggregate:


Customer and Orders link between them represent an aggregate.
 The diamond shows how data fit into the aggregate structure.
 Customer contains a list of billing address
 Payment also contains the billing address
 The address appears three times and it is copied each time
 The domain is fit where we don’t want to change shipping and billing
address.
Consequences of Aggregate Orientation:
 Aggregation is not a logical data property It is all about how the data is being
used by applications.
 An aggregate structure may be an obstacle for others but help with some data
interactions.
 It has an important consequence for transactions.
 NoSQL databases don’t support ACID transactions thus sacrificing
consistency.
 aggregate-oriented databases support the atomic manipulation of a single
aggregate at a time.
Advantage:
 It can be used as a primary data source for online applications.
 Easy Replication.
 No single point Failure.
 It provides fast performance and horizontal Scalability.
It can handle Structured semi-structured and unstructured data with equal
effort.
Disadvantage:
 No standard rules.
 Limited query capabilities.
 Doesn’t work well with relational data.
 Not so popular in the enterprise.
 When the value of data increases it is difficult to maintain unique values.

The aggregate-Oriented database is the NoSQL database which does not support ACID
transactions and they sacrifice one of the ACID properties. Aggregate orientation operations
are different compared to relational database operations. We can perform OLAP operations
on the Aggregate-Oriented database. The efficiency of the Aggregate-Oriented database is
high if the data transactions and interactions take place within the same aggregate. Several
fields of data can be put in the aggregates such that they can be commonly accessed
together. We can manipulate only a single aggregate at a time. We can not manipulate
multiple aggregates at a time in an atomic way.
Aggregate – Oriented databases are classified into four major data models. They are as
follows:
 Key-value
 Document
 Column family
 Graph-based
Each of the Data models above has its own query language.
 key-value Data Model: Key-value and document databases were strongly
aggregate-oriented. The key-value data model contains the key or Id which is used
to access the data of the aggregates. key-value Data Model is very secure as the
aggregates are opaque to the database. Aggregates are encrypted as the big blog
of bits that can be decrypted with key or id. In the key-value Data Model, we can
place data of any structure and datatypes in it. The advantage of the key-value
Data Model is that we can store the sensitive information in the aggregate. But the
disadvantage of this model the database has some general size limits. We can store
only the limited data.
 Document Data Model: In Document Data Model we can access the parts of
aggregates. The data in this model can be accessed inflexible manner. we can
submit queries to the database based on the fields in the aggregate. There is a
restriction on the structure and data types of data to be paced in this data model.
The structure of the aggregate can be accessed by the Document Data Model.
 Column family Data Model: The Column family is also called a two-level map.
But, however, we think about the structure, it has been a model that influenced
later databases such as HBase and Cassandra. These databases with a big table-
style data model are often referred to as column stores. Column-family models
divide the aggregate into column families. The Column-family model is a two-
level aggregate structure. The first level consists of keys that act as a row identifier
that selects the aggregate. The second-level values in the Column family Data
Model are referred to as columns.
 Graph Data Model: In a graph data model, the data is stored in nodes that are
connected by edges. This model is preferred to store a huge amount of complex
aggregates and multidimensional data with many interconnections between them.
Graph Data Model has the application like we can store the Facebook user
accounts in the nodes and find out the friends of the particular user by following
the edges of the graph.
HBase-
HBase Data Model
The Data Model in HBase is designed to accommodate semi-structured data that could vary
in field size, data type and columns. Additionally, the layout of the data model makes it easier
to partition the data and distribute it across the cluster. The Data Model in HBase is made of
different logical components such as Tables, Rows, Column Families, Columns, Cells and
Versions.

Tables – The HBase Tables are more like logical collection of rows stored in separate
partitions called Regions. As shown above, every Region is then served by exactly one
Region Server. The figure above shows a representation of a Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are
unique in a Table and are always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column
Family has one more Columns and these Columns in a family are stored together in a low
level storage file known as HFile. Column Families form the basic unit of physical storage to
which certain HBase features like compression are applied. Hence it’s important that proper
care be taken when designing Column Families in table.
The table above shows Customer and Sales Column Families. The Customer Column Family
is made up 2 columns – Name and City, whereas the Sales Column Families is made up to 2
columns – Product and Amount.

Columns – A Column Family is made of one or more columns. A Column is identified by a


Column Qualifier that consists of the Column Family name concatenated with the Column
name using a colon – example:
column family: column name. There can be multiple Columns within a Column Family and
Rows within a table can have varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family
and the Column (Column Qualifier). The data stored in a Cell is called its value and the data
type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the
timestamp. The number of versions of data retained in a column family is configurable and
this value by default is 3.
HBase Client-
An HBase internal class which represents an HBase table is HTable. Basically, to
communicate with a single HBase table, we use this implementation of a table. It belongs to
the org.apache.hadoop.hbase.client class.
a. Constructors
i. HTable()
ii. HTable(TableName tableName, ClusterConnection connection, ExecutorService
pool)
We can create an object to access an HBase table, by using this constructor.
b. Methods
i. void close()
Basically, to release all the resources of the HTable, we use this method.
ii. void delete(Delete delete)
The method “void delete(Delete delete)” helps to delete the specified cells/row.
iii. boolean exists(Get get)
As specified by Get, it is possible to test the existence of columns in the table, with this
method.
iv. Result get(Get get)
This method retrieves certain cells from a given row.
v. org.apache.hadoop.conf.Configuration getConfiguration()
It returns the Configuration object used by this instance.
vi. TableName getName()
This method returns the table name instance of this table.
vii. HTableDescriptor getTableDescriptor()
It returns the table descriptor for this table.

viii. byte[] getTableName()


This method returns the name of this table.
ix. void put(Put put)
We can insert data into the table, by using this method.
Class Put in HBase Client API
In order to perform put operations for a single row, we use this class. This class belongs to
the org.apache.hadoop.hbase.client package.
a. Constructors
i. Put(byte[] row)
We can create a Put operation for the specified row, by using this constructor.
ii. Put(byte[] rowArray, int rowOffset, int rowLength)
However, to make a copy of the passed-in row key to keep local, we use it.
iii. Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
We can make a copy of the passed-in row key to keep local, by using this constructor.
iv. Put(byte[] row, long ts)
Basically, to create a Put operation for the specified row, using a given timestamp, we use it.
b. Methods
i. Put add(byte[] family, byte[] qualifier, byte[] value)
The method “Put add(byte[] family, byte[] qualifier, byte[] value)” adds the specified column
and value to this Put operation.
ii. Put add(byte[] family, byte[] qualifier, long ts, byte[] value)
With the specified timestamp, it adds the specified column and value, as its version to this Put
operation.
iii. Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
This method adds the specified column and value, with the specified timestamp as its version
to this Put operation.
iv. Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
With the specified timestamp, it adds the specified column and value, as its version to this Put
operation.
Class Get in HBase Client API
To perform Get operations on a single row, we use this class. It belongs to the
org.apache.hadoop.hbase.client package.
a. Constructor
i. Get(byte[] row)
It is possible to create a Get operation for the specified row, by using this constructor.
ii. Get(Get get)
b. Methods
i. Get addColumn(byte[] family, byte[] qualifier)
To retrieves the column from the specific family with the specified qualifier, this method
helps.
ii. Get addFamily(byte[] family)
Whereas this one helps to retrieves all columns from the specified family.
Class Delete in HBase Client API
In order to perform delete operations on a single row, we use this Class. Instantiate a Delete
object with the row to delete, to delete an entire row. It belongs to the
org.apache.hadoop.hbase.client package.
a. Constructor
i. Delete(byte[] row)
To create a delete operation for the specified row, we use it.
ii. Delete(byte[] rowArray, int rowOffset, int rowLength)
This constructor creates a Delete operation for the specified row and timestamp.
iii. Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)
Basically, the constructor “Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)”
creates a Delete operation.

iv. Delete(byte[] row, long timestamp)


Again the constructor “Delete(byte[] row, long timestamp)” also creates a Delete operation.
b. Methods
i. Delete addColumn(byte[] family, byte[] qualifier)
This method helps to delete the latest version of the specified column.
ii. Delete addColumns(byte[] family, byte[] qualifier, long timestamp)
However, to delete all versions of the specified column we use this method, especially, With
a timestamp less than or equal to the specified timestamp.
iii. Delete addFamily(byte[] family)
The method “Delete addFamily(byte[] family)” deletes all versions of all columns of the
specified family.
iv. Delete addFamily(byte[] family, long timestamp)
Again, with a timestamp less than or equal to the specified timestamp, this method also
deletes all columns of the specified family.
Class Result in HBase Client API
In order to get a single row result of a Get or a Scan query, we use class result HBase Client
API.
a. Constructors
i. Result()
With no KeyValue payload, it is possible to create an empty Result; returns null if you call
raw Cells(), by using this constructor.
b. Methods
i. byte[] getValue(byte[] family, byte[] qualifier)
Basically, in order to get the latest version of the specified column, we use this method.
ii. byte[] getRow()
Moreover, to retrieve the row key which corresponds to the row from which this Result was
created, we use this method.
Cassandra
Apache Cassandra is a highly scalable, high-performance distributed database designed to
handle large amounts of data across many commodity servers, providing high availability with
no single point of failure. It is a type of NoSQL database. Let us first understand what a NoSQL
database does.
What is Apache Cassandra?
Apache Cassandra is an open source, distributed and decentralized/distributed storage system
(database), for managing very large amounts of structured data spread out across the world. It
provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra −
 It is scalable, fault-tolerant, and consistent.
 It is a column-oriented database.
 Its distribution design is based on Amazon’s Dynamo and its data model on
Google’s Bigtable.
 Created at Facebook, it differs sharply from relational database management
systems.
 Cassandra implements a Dynamo-style replication model with no single point
of failure, but adds a more powerful “column family” data model.
 Cassandra is being used by some of the biggest companies such as Facebook,
Twitter, Cisco, Rackspace, ebay, Twitter, Netflix, and more.
Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given below
are some of the features of Cassandra:
 Elastic scalability − Cassandra is highly scalable; it allows to add more
hardware to accommodate more customers and more data as per requirement.
 Always on architecture − Cassandra has no single point of failure and it is
continuously available for business-critical applications that cannot afford a
failure.
 Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases
your throughput as you increase the number of nodes in the cluster. Therefore it
maintains a quick response time.
 Flexible data storage − Cassandra accommodates all possible data formats
including: structured, semi-structured, and unstructured. It can dynamically
accommodate changes to your data structures according to your need.
 Easy data distribution − Cassandra provides the flexibility to distribute data
where you need by replicating data across multiple data centers.
 Transaction support − Cassandra supports properties like Atomicity,
Consistency, Isolation, and Durability (ACID).
 Fast writes − Cassandra was designed to run on cheap commodity hardware. It
performs blazingly fast writes and can store hundreds of terabytes of data,
without sacrificing the read efficiency.
History of Cassandra
 Cassandra was developed at Facebook for inbox search.
 It was open-sourced by Facebook in July 2008.
 Cassandra was accepted into Apache Incubator in March 2009.
 It was made an Apache top-level project since February 2010.
The design goal of Cassandra is to handle big data workloads across multiple nodes without
any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and
data is distributed among all the nodes in a cluster.
 All the nodes in a cluster play the same role. Each node is independent and at
the same time interconnected to other nodes.
 Each node in a cluster can accept read and write requests, regardless of where
the data is actually located in the cluster.
 When a node goes down, read/write requests can be served from other nodes in
the network.
Data Replication in Cassandra
In Cassandra, one or more of the nodes in a cluster act as replicas for a given piece of data. If
it is detected that some of the nodes responded with an out-of-date value, Cassandra will return
the most recent value to the client. After returning the most recent value, Cassandra performs
a read repair in the background to update the stale values.
The following figure shows a schematic view of how Cassandra uses data replication among
the nodes in a cluster to ensure no single point of failure.
Note − Cassandra uses the Gossip Protocol in the background to allow the nodes to
communicate with each other and detect any faulty nodes in the cluster.
Components of Cassandra
The key components of Cassandra are as follows −
 Node − It is the place where data is stored.
 Data center − It is a collection of related nodes.
 Cluster − A cluster is a component that contains one or more data centers.
 Commit log − The commit log is a crash-recovery mechanism in Cassandra.
Every write operation is written to the commit log.
 Mem-table − A mem-table is a memory-resident data structure. After commit
log, the data will be written to the mem-table. Sometimes, for a single-column
family, there will be multiple mem-tables.
 SSTable − It is a disk file to which the data is flushed from the mem-table when
its contents reach a threshold value.
 Bloom filter − These are nothing but quick, nondeterministic, algorithms for
testing whether an element is a member of a set. It is a special kind of cache.
Bloom filters are accessed after every query.

A Cassandra column family has the following attributes −


 keys_cached − It represents the number of locations to keep cached per
SSTable.
 rows_cached − It represents the number of rows whose entire contents will be
cached in memory.
 preload_row_cache − It specifies whether you want to pre-populate the row
cache.
example of a Cassandra column family
Hadoop Integration
Big Data Management can connect to clusters that run different Hadoop distributions. Hadoop is an open-
source software framework that enables distributed processing of large data sets across clusters of machines.
You might also need to use third-party software clients to set up and manage your Hadoop cluster.
Big Data Management can connect to the supported data source in the Hadoop environment, such as HDFS,
HBase, or Hive, and push job processing to the Hadoop cluster. To enable high performance access to files
across the cluster, you can connect to an HDFS source. You can also connect to a Hive source, which is a
data warehouse that connects to HDFS.
It can also connect to NoSQL databases such as HBase, which is a database comprising key-value pairs on
Hadoop that performs operations in real-time. The Data Integration Service can push mapping jobs to the
Spark or Blaze engine, and it can push profile jobs to the Blaze engine in the Hadoop environment.
Big Data Management supports more than one version of some Hadoop distributions. By default, the cluster
configuration wizard populates the latest supported version.

The Data Integration Service automatically installs the Hadoop binaries to integrate the Informatica domain
with the Hadoop environment. The integration requires Informatica connection objects and cluster
configurations. A cluster configuration is a domain object that contains configuration parameters that you
import from the Hadoop cluster. You then associate the cluster configuration with connections to access the
Hadoop environment.
Perform the following tasks to integrate the Informatica domain with the Hadoop environment:
1. Install or upgrade to the current Informatica version.
2. Perform pre-import tasks, such as verifying system requirements and user permissions.
3. Import the cluster configuration into the domain. The cluster configuration contains properties from the
*-site.xml files on the cluster.
4. Create a Hadoop connection and other connections to run mappings within the Hadoop environment.
5. Perform post-import tasks specific to the Hadoop distribution that you integrate with.
When you run a mapping, the Data Integration Service checks for the binary files on the cluster. If they do
not exist or if they are not synchronized, the Data Integration Service prepares the files for transfer. It
transfers the files to the distributed cache through the Informatica Hadoop staging directory on HDFS. By
default, the staging directory is /tmp. This transfer process replaces the requirement to install distribution
packages on the Hadoop cluster.

Hadoop integration used:


 Hadoop integration using a standard Guardium S-TAP
Learn how to integrate Hadoop using a standard Guardium S-TAP for HDFS and
MapReduce monitoring.
 Hadoop integration using Cloudera Navigator
Learn how to integrate Hadoop using Cloudera Navigator, Cloudera's native data
governance solution.
 Hadoop integration using Hortonworks and Apache Ranger
Apache Ranger, included with the Hortonworks Data Platform, offers fine-grained access
control and auditing over Hadoop components such as Hive, HBASE, and HDFS by using
policies.

PIG

Introduction to PIG :

 Pig is a high-level platform or tool which is used to process large datasets.


 It provides a high level of abstraction for processing over MapReduce.
 It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes.
 Pig Latin and Pig Engine are the two main components of the Apache Pig tool.
 The result of Pig is always stored in the HDFS.
 One limitation of MapReduce is that the development cycle is very long. Writing the
reducer and mapper, compiling packaging the code, submitting the job and retrieving
the output is a time-consuming task.
 Apache Pig reduces the time of development using the multi-query approach.
 Pig is beneficial for programmers who are not from Java backgrounds.
 200 lines of Java code can be written in only 10 lines using the Pig Latin language.
 Programmers who have SQL knowledge needed less effort to learn Pig Latin.
 Execution Modes of Pig :
 Apache Pig scripts can be executed in three ways :
 Interactive Mode (Grunt shell) :
 You can run Apache Pig in interactive mode using the Grunt shell.
 In this shell, you can enter the Pig Latin statements and get the output (using the Dump
operator).

 Batch Mode (Script) :


 You can run Apache Pig in Batch mode by writing the Pig Latin script in a single file
with the .pig extension.
 Embedded Mode (UDF) :
 Apache Pig provides the provision of defining our own functions
(User Defined Functions) in programming languages such as Java and using them in
our script.
Apache Pig comes with the following features −
 Rich set of operators − It provides many operators to perform operations like join,
sort, filer, etc.
 Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.
 Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
 Extensibility − Using the existing operators, users can develop their own functions to
read, process, and write data.
 UDF’s − Pig provides the facility to create User-defined Functions in other
programming languages such as Java and invoke or embed them in Pig Scripts.
 Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as
well as unstructured. It stores the results in HDFS.

Grunt :
Grunt is Pig's interactive shell. It enables users to enter Pig Latin interactively and
provides a shell for users to interact with HDFS. This gives you a Grunt shell to
interact with your local file system.
 Grunt shell is a shell command.
 The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
 Pig script can be executed with grunt shell which is a native shell provided by Apache
pig to execute pig queries.
 We can invoke shell commands using sh and fs.
 Syntax of sh command :
grunt> sh ls

 Syntax of fs command :

grunt>fs -ls

Grunt is a JavaScript task runner that helps us in automating mundane and repetitive tasks
like minification, compilation, unit testing, linting, etc. Grunt has hundreds of plugins to
choose from, you can use Grunt to automate just about anything with a minimum of effort.
The objective of this article is to get started with Grunt and to learn how to automatically
minify our JavaScript files and validate them using JSHint.
Installing Grunt-CLI: First, you need to install Grunt’s command-line interface (CLI)
globally so we can use it from everywhere.
$ npm install -g grunt-cli
Creating a new Grunt Project: You will need to create a new project or you can use an
existing project.
Let’s call it grunt_app.
Now you will need to add two files to your project: package.json and the Gruntfile.
package.json: It stores the various devDependencies and dependencies for your project as
well as some metadata. You will list grunt and the Grunt plugins your project needs as
devDependencies in this file.
Gruntfile: This is a configuration file for grunt. It can be named
as Gruntfile.js or Gruntfile.coffee.
Run the following commands from the root directory of your project:
// Generate a package.json file
$ npm init

// Install grunt and add in package.json


$ npm install grunt --save-dev
Now create a file in your directory called Gruntfile.js and copy the following into it.
module.exports = function(grunt) {
// Do grunt-related things in here
};
This is the “wrapper” function and all of the Grunt code must be specified inside it. It includes
the project configuration and task configuration.
Now create two more files: index.html and main.js
index.html

<html>

<body>

<h1>Hello World</h1>

<script src="main.min.js"></script>

</body>

</html>

main.js
function greet() {

alert("Hello GeeksForGeeks");

We will use a grunt-contrib-uglify plugin to minify JavaScript files with UglifyJS.


Install grunt-contrib-uglify:
$ npm install grunt-contrib-uglify --save-dev
For upgrading we use this format:
module.exports = function(grunt) {
grunt.initConfig({
pkg: grunt.file.readJSON('package.json'),
uglify: {
build: {
src: 'main.js',
dest: 'main.min.js'
}
}
});
grunt.loadNpmTasks('grunt-contrib-uglify');
};

Now you can run $ grunt uglify to minify your file. You can also set default tasks for grunt
which run whenever $ grunt is run.
To validate our JavaScript files we will use grunt-contrib-jshint. Install the plugin using $
npm install grunt-contrib-jshint --save-dev You can use this by running $ grunt jshint
module.exports = function(grunt) {
grunt.initConfig({
pkg: grunt.file.readJSON('package.json'),
uglify: {
build: {
src: 'main.js',
dest: 'main.min.js'
}
},
jshint: {
options: {
curly: true,
eqeqeq: true,
eqnull: true,
browser: true,
globals: {
jQuery: true
},
},
uses_defaults: ['*.js']
},
});
grunt.loadNpmTasks('grunt-contrib-uglify');
grunt.loadNpmTasks('grunt-contrib-jshint');

// Default task(s).
grunt.registerTask('default', ['uglify']);
};

PIG Data Model


It is a tool/platform which is used to analyze larger sets of data representing them as data
flows. Pig is generally used with Hadoop, we can perform all the data manipulation
operations in Hadoop using Apache Pig.
The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such
as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.
Atom
Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored
as string and can be used as string and number. int, long, float, double, chararray, and bytearray
are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.
Example − ‘raja’ or ‘30’

Tuple
A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any
type. A tuple is similar to a row in a table of RDBMS.
Example − (Raja, 30)

Bag
A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known
as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by
‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that
every tuple contain the same number of fields or that the fields in the same position (column)
have the same type.
Example − {(Raja, 30), (Mohammad, 45)}
A bag can be a field in a relation; in that context, it is known as inner bag.
Example − {Raja, 30, {9848022338, raja@gmail.com,}}

Map
A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and
should be unique. The value might be of any type. It is represented by ‘[]’
Example − [name#Raja, age#30]

Relation
A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that
tuples are processed in any particular order).
PIG Latin
The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a highlevel
data processing language which provides a rich set of data types and operators to perform
various operations on the data.
To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt
Shell, UDFs, Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output.
Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it
makes the programmer’s job easy. The architecture of Apache Pig is shown below.

Apache Pig Components


As shown in the figure, there are various components in the Apache Pig framework. Let us take
a look at the major components.
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop.
It is a textual language that abstracts the programming from the Java MapReduce idiom into a
notation.
The Pig Latin statements are used to process the data.
It is an operator that accepts a relation as an input and generates another relation as an output.
· It can span multiple lines.
· Each statement must end with a semi-colon.
· It may include expression and schemas.
· By default, these statements are processed using multi-query execution

User-Defined Functions :
Apache Pig provides extensive support for User Defined Functions(UDF’s).
Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
For writing UDF’s, complete support is provided in Java and limited support is provided in all
the remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like data load/store,
column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java language work
efficiently compared to other languages.
Types of UDF’s in Java :
Filter Functions :

 The filter functions are used as conditions in filter statements.


 These functions accept a Pig value as input and return a Boolean value.
Eval Functions :

 The Eval functions are used in FOREACH-GENERATE statements.


 These functions accept a Pig value as input and return a Pig result.
Algebraic Functions :

 The Algebraic functions act on inner bags in a FOREACHGENERATE statement.


 These functions are used to perform full MapReduce operations on an inner bag.
Developing and Testing Pig Latin Scripts
Comments in Pig Script
While writing a script in a file, we can include comments in it as shown below.
Multi-line comments
We will begin the multi-line comments with '/*', end them with '*/'.
/* These are the multi-line comments
In the pig script */
Single –line comments
We will begin the single-line comments with '--'.
--we can write single line comments like this.
Executing Pig Script in Batch mode
While executing Apache Pig statements in batch mode, follow the steps given below.
Step 1
Write all the required Pig Latin statements in a single file. We can write all the Pig Latin
statements and commands in a single file and save it as .pig file.
Step 2
Execute the Apache Pig script. You can execute the Pig script from the shell (Linux) as shown
below.

Local mode MapReduce mode

$ pig -x local Sample_script.pig $ pig -x mapreduce Sample_script.pig


You can execute it from the Grunt shell as well using the exec command as shown below.
grunt> exec /sample_script.pig
Executing a Pig Script from HDFS
We can also execute a Pig script that resides in the HDFS. Suppose there is a Pig script with
the name Sample_script.pig in the HDFS directory named /pig_data/. We can execute it as
shown below.
$ pig -x mapreduce hdfs://localhost:9000/pig_data/Sample_script.pig
Example
Assume we have a file student_details.txt in HDFS with the following content.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
We also have a sample script with the name sample_script.pig, in the same HDFS directory.
This file contains statements performing operations and transformations on
the student relation, as shown below.
student = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);

student_order = ORDER student BY age DESC;

student_limit = LIMIT student_order 4;

Dump student_limit;
 The first statement of the script will load the data in the file
named student_details.txt as a relation named student.
 The second statement of the script will arrange the tuples of the relation in descending
order, based on age, and store it as student_order.
 The third statement of the script will store the first 4 tuples
of student_order as student_limit.
 Finally the fourth statement will dump the content of the relation student_limit.

Let us now execute the sample_script.pig as shown below.


$./pig -x mapreduce hdfs://localhost:9000/pig_data/sample_script.pig
Apache Pig gets executed and gives you the output with the following content.
(7,Komal,Nayak,24,9848022334,trivendram)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,23,9848022335,Chennai)
2015-10-19 10:31:27,446 [main] INFO org.apache.pig.Main - Pig script completed in 12
minutes, 32 seconds and 751 milliseconds (752751 ms)

Development Tools

Pig provides several tools and diagnostic operators to help you develop your applications. In
this section we will explore these and also look at some tools others have written to make it
easier to develop Pig with standard editors and integrated development environments (IDEs).

Syntax Highlighting and Checking

Syntax highlighting often helps users write code correctly, at least syntactically, the first time
around. Syntax highlighting packages exist for several popular editors. created and added at
various times, so how their highlighting conforms with current Pig Latin syntax varies.

Table 7-1. Pig Latin syntax highlighting packages

Tool URL

Eclipse http://code.google.com/p/pig-eclipse

Emacs http://github.com/cloudera/piglatin-mode, http://sf.net/projects/pig-mode

TextMate http://www.github.com/kevinweil/pig.tmbundle

Vim http://www.vim.org/scripts/script.php?script_id=2186

In addition to these syntax highlighting packages, Pig will also let you check the syntax of your
script without running it. If you add -c or -check to the command line, Pig will just parse and
run semantic checks on your script. The -dryrun command-line option will also check your
syntax, expand any macros and imports, and perform parameter substitution.

Testing Modes
You can run Apache Pig in two modes, namely, Local Mode and HDFS mode.
Local Mode
In this mode, all the files are installed and run from your local host and local file system. There
is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode
MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.
Apache Pig Execution Mechanisms
Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.
 Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using
the Grunt shell. In this shell, you can enter the Pig Latin statements and get the output
(using Dump operator).
 Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.
 Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using
them in our script.

HIVE
HIVE Datatypes
This chapter takes you through the different data types in Hive, which are involved in the table
creation. All the data types in Hive are classified into four types, given as follows:

 Column Types
 Literals
 Null Values
 Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds
the range of INT, you need to use BIGINT and if the data range is smaller than the INT, you
use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:

Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L

String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255

Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Floating Point Types
Floating point types are nothing but numbers with decimal points. Generally, this type of data
is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>

Hive File Format


Hive facilitates managing large data sets supporting multiple data formats,
including comma-separated value (. csv) TextFile, RCFile, ORC, and Parquet.
Different file formats and compression codecs work better for different data sets in Apache
Hive.
Following are the Apache Hive different file formats:

 Text File
 Sequence File
 RC File
 AVRO File
 ORC File
 Parquet File

Hive Data Types

Now that you know how data is classified in Hive. Let us look into the different Hive data
types. These are classified as primitive and complex data types.

Primitive Data Types:

1. Numeric Data types - Data types like integral, float, decimal

2. String Data type - Data types like char, string

3. Date/ Time Data type - Data types like timestamp, date, interval

4. Miscellaneous Data type - Data types like Boolean and binary

Complex Data Types:

1. Arrays - A collection of the same entities. The syntax is: array<data_type>

2. Maps - A collection of key-value pairs and the syntax is map<primitive_type, data_type>

3. Structs - A collection of complex data with comments. Syntax: struct<col_name : data_type


[COMMENT col_comment],…..>

4. Units - A collection of heterogeneous data types. Syntax: uniontype<data_type, data_type,..>

Hive Text File Format

Hive Text file format is a default storage format. You can use the text format to interchange
the data with other client application. The text file format is very common most of the
applications. Data is stored in lines, with each line being a record. Each lines are terminated
by a newline character (\n).
The text format is simple plane file format. You can use the compression (BZIP2) on the text
file to reduce the storage spaces.
Create a TEXT file by add storage option as ‘STORED AS TEXTFILE’ at the end of a
Hive CREATE TABLE command.
Hive Text File Format Examples
Below is the Hive CREATE TABLE command with storage format specification:

Create table textfile_table

(column_specs)

stored as textfile;

Hive Sequence File Format

Sequence files are Hadoop flat files which stores values in binary key-value pairs. The
sequence files are in binary format and these files are able to split. The main advantages of
using sequence file is to merge two or more files into one file.
Create a sequence file by add storage option as ‘STORED AS SEQUENCEFILE’ at the end
of a Hive CREATE TABLE command.
Hive Sequence File Format Example
Below is the Hive CREATE TABLE command with storage format specification:

Create table sequencefile_table

(column_specs)

stored as sequencefile;

Hive RC File Format

RCFile is row columnar file format. This is another form of Hive file format which offers
high row level compression rates. If you have requirement to perform multiple rows at a time
then you can use RCFile format.
The RCFile are very much similar to the sequence file format. This file format also stores the
data as key-value pairs.

Create RCFile by specifying ‘STORED AS RCFILE’ option at the end of a CREATE


TABLE Command:
Hive RC File Format Example
Below is the Hive CREATE TABLE command with storage format specification:

Create table RCfile_table

(column_specs)

stored as rcfile;

Hive AVRO File Format

AVRO is open source project that provides data serialization and data exchange services for
Hadoop. You can exchange data between Hadoop ecosystem and program written in any
programming languages. Avro is one of the popular file format in Big Data Hadoop based
applications.
Create AVRO file by specifying ‘STORED AS AVRO’ option at the end of a CREATE
TABLE Command.
Hive AVRO File Format Example
Below is the Hive CREATE TABLE command with storage format specification:

Create table avro_table

(column_specs)

stored as avro;

Hive ORC File Format

The ORC file stands for Optimized Row Columnar file format. The ORC file format
provides a highly efficient way to store data in Hive table. This file system was actually
designed to overcome limitations of the other Hive file formats. The Use of ORC files
improves performance when Hive is reading, writing, and processing data from large tables.
Create ORC file by specifying ‘STORED AS ORC’ option at the end of a CREATE TABLE
Command.
Hive ORC File Format Examples
Below is the Hive CREATE TABLE command with storage format specification:

Create table orc_table

(column_specs)

stored as orc;

Hive Parquet File Format

Parquet is a column-oriented binary file format. The parquet is highly efficient for the types
of large-scale queries. Parquet is especially good for queries scanning particular columns
within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy
by default.
Create Parquet file by specifying ‘STORED AS PARQUET’ option at the end of a
CREATE TABLE Command.
Hive Parquet File Format Example
Below is the Hive CREATE TABLE command with storage format specification:

Create table parquet_table

(column_specs)

stored as parquet;

HiveQL Data Definition


HiveQL :
Even though based on SQL, HiveQL does not strictly follow the full SQL-92 standard.
HiveQL offers extensions not in SQL, including multitable inserts and create table as select.
HiveQL lacked support for transactions and materialized views and only limited subquery
support.
Support for insert, update, and delete with full ACID functionality was made available with
release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph of
MapReduce Tez, or Spark jobs, which are submitted to Hadoop for execution.
Example :
DROP TABLE IF EXISTS docs;
CREATE TABLE docs
(line STRING);

Checks if table docs exist and drop it if it does. Creates a new table called docs with a single
column of type STRING called line.
LOAD DATA INPATH 'input_file' OVERWRITE INTO TABLE docs;

Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE specifies that the target table to which the data is being loaded is to be re-written;
Otherwise, the data would be appended.
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) temp
GROUP BY word
ORDER BY word;

The query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a
table called word_counts with two columns: word and count.
This query draws its input from the inner
query (SELECT explode(split(line, '\s')) AS word FROM docs) temp".
This query serves to split the input words into different rows of a temporary table aliased
as temp.
The GROUP BY WORD groups the results based on their keys.
This results in the count column holding the number of occurrences for each word of
the word column.
The ORDER BY WORDS sorts the words alphabetically.
Tables :
Here are the types of tables in Apache Hive:
Managed Tables :

In a managed table, both the table data and the table schema are managed by Hive.
The data will be located in a folder named after the table within the Hive data warehouse, which
is essentially just a file location in HDFS.
By managed or controlled we mean that if you drop (delete) a managed table, then Hive will
delete both the Schema (the description of the table) and the data files associated with the table.
Default location is /user/hive/warehouse.
The syntax for Managed Tables :
CREATE TABLE IF NOT EXISTS stocks (exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;

External Tables :
An external table is one where only the table schema is controlled by Hive.
In most cases, the user will set up the folder location within HDFS and copy the data file(s)
there.
This location is included as part of the table definition statement.
When an external table is deleted, Hive will only delete the schema associated with the table.
The data files are not affected.
Syntax for External Tables :
CREATE EXTERNAL TABLE IF NOT EXISTS stocks
(exchange STRING,
symbol STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
Querying Data :
A query is a request for data or information from a database table or a combination of tables.
This data may be generated as results returned by Structured Query Language (SQL) or as
pictorials, graphs or complex results, e.g., trend analyses from data-mining tools.
One of several different query languages may be used to perform a range of simple to complex
database queries.
SQL, the most well-known and widely-used query language, is familiar to most database
administrators (DBAs)

User-Defined Functions :
In Hive, the users can define their own functions to meet certain client requirements.
These are known as UDFs in Hive.
User-Defined Functions written in Java for specific modules.
Some of UDFs are specifically designed for the reusability of code in application frameworks.
The developer will develop these functions in Java and integrate those UDFs with the Hive.
During the Query execution, the developer can directly use the code, and UDFs will return
outputs according to the user-defined tasks.
It will provide high performance in terms of coding and execution.
The general type of UDF will accept a single input value and produce a single output value.
We can use two different interfaces for writing Apache Hive User-Defined Functions :
1. Simple API
2. Complex API
Sorting And Aggregating :

Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but there is a
catch.
ORDER BY produces a result that is totally sorted, as expected, but to do so it sets the number
of reducers to one, making it very inefficient for large datasets.
When a globally sorted result is not required and in many cases it isn’t, then you can use Hive’s
nonstandard extension, SORT BY instead.
SORT BY produces a sorted file per reducer.
If you want to control which reducer a particular row goes to, typically so you can perform
some subsequent aggregation.
This is what Hive’s DISTRIBUTE BY clause does.
Example :
· To sort the weather dataset by year and temperature, in such a way to ensure that all the
rows for a given year end up in the same reducer partition :
Hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC,
temperature DESC;

· Output :
1949 111
1949 78
1950 22
1950 0
1950 -11

HiveQL Data Manipulation


Hive DML (Data Manipulation Language) commands are used to insert, update, retrieve, and
delete data from the Hive table once the table and database schema has been defined using
Hive DDL commands.

The various Hive DML commands are:

1. LOAD
2. SELECT
3. INSERT
4. DELETE
5. UPDATE
6. EXPORT
7. IMPORT

1.LOAD - The LOAD statement in Hive is used to move data files into the locations
corresponding to Hive tables.
Syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)];

2.SELECT - The SELECT statement in Hive is similar to the SELECT statement in SQL used for
retrieving data from the database.

Syntax:
SELECT col1,col2 FROM tablename;
3.INSERT - The INSERT command in Hive loads the data into a Hive table. We can do
insert to both the Hive table or partition.
Syntax:
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)]
select_statement1 FROM from_statement;

4. DELETE - The DELETE statement in Hive deletes the table data. If the WHERE clause is
specified, then it deletes the rows that satisfy the condition in where clause.

The DELETE statement can only be used on the hive tables that support ACID.

Syntax:
DELETE FROM tablename [WHERE expression];

5.UPDATE - The UPDATE statement in Hive deletes the table data. If the WHERE clause is
specified, then it updates the column of the rows that satisfy the condition in WHERE clause.
Syntax:
UPDATE tablename SET column = value [, column = value ...] [WHERE expression];
6.EXPORT - The Hive EXPORT statement exports the table or partition data along with the
metadata to the specified output location in the HDFS.
Metadata is exported in a _metadata file, and data is exported in a subdirectory ‘data.’
Syntax:
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])]
TO 'export_target_path' [ FOR replication('eventid') ];
7.IMPORT - The Hive IMPORT command imports the data from a specified location to a
new table or already existing table.

Syntax:
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION
(part_column="value"[, ...])]]
FROM 'source_path' [LOCATION 'import_target_path'];
Hive Query Language is a language used in Hive, similar to SQL, to process and analyze
unstructured data.

Hive Query Language is easy to use if you are familiar with SQL. The syntax of Hive QL is
very similar to SQL with slight differences.
Hive QL supports DDL, DML, and user-defined functions. Commands such as

 Inserting data into Hive tables from queries


 Inserting data into dynamic partitions
 Writing data into files from queries
 Enabling transactions in Hive
 Inserting values into tables from SQL
 Updating data
 Deleting data

HiveQL Queries

Hive provides SQL type querying language for the ETL purpose on top
of Hadoop file system. The standard programming language used to create database
management tasks and processes is called Structured Query Language (SQL).
However, SQL is not the only programming language used to perform queries and data
analysis using Hive. AQL, Datalog, and DMX are also popular choices.
Hive Query Language, or HiveQL, is a declarative language akin to SQL.
It also enables developers to process and analyze structured and semi-structured data by
substituting complicated MapReduce programs with Hive queries.
Any developer who is well acquainted with SQL commands will find it easy to create
requests using Hive Query Language.
Hive Query language (HiveQL) provides SQL type environment in Hive to work with tables,
databases, queries.

We can have a different type of Clauses associated with Hive to perform different type data
manipulations and querying. For better connectivity with different nodes outside the
environment. HIVE provide JDBC connectivity as well.

Hive queries provides the following features:

 Data modeling such as Creation of databases, tables, etc.


 ETL functionalities such as Extraction, Transformation, and Loading data into tables
 Joins to merge different data tables
 User specific custom scripts for ease of code
 Faster querying tool on top of Hadoop

There are four types of joins, and a deeper understanding of each one will help users pick the
right join to use—, and write the right queries. These four types of joins are:
 Inner join in Hive
 Left Outer Join in Hive
 Right Outer Join in Hive
 Full Outer Join in Hive
Examples of Hive Queries
Order By Query
The ORDER BY syntax in HiveQL uses the “SELECT” statement to help sort data. This
syntax goes through the columns on Hive tables to find and sift specific column values as
instructed in the “Order by” clause. The query will only pick the column name mentioned in
the Order by clause, and display the matching column values in ascending or descending
order.
Group By Query
When a Hive query comes with a “GROUP BY”, it explores the columns on Hive tables and
gathers all column values mentioned with the group by clause. The query will only look at
the columns under the name defined as “group by” clause, and it will show the results by
grouping the specific and matching column values.
Sort By Query
When a Hive query comes with a “Sort by” clause, it goes through the columns under the
name defined by the query. Once executed, the query explores columns of Hive tables to sort
the output. If you sort by queries with a “DESC” instruction, you sort and display the results
in descending order. Queries with an “ASC” will perform an ascending order of the sort and
show the results in a similar manner.
Cluster By Query
Hive queries with a CLUSTER BY clause or command are typically deployed in queries to
perform the functions of both DISTRIBUTE BY and SORT BY together. This particular
query ensures absolute sorting or ordering across all output data files.
Distribute By
The DISTRIBUTE BY instruction determines how the output is divided among reducers in a
MapReduce job. DISTRIBUTE BY functions similarly to a GROUP BY clause as it manages
how rows of data will be loaded into the reducer for processing.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy