DBMS Unit-5
DBMS Unit-5
Class - T.Y.PLD(Division-)
AY 2023-2024
SEM-I
1
Unit – V
2
MIT School of Computing
Department of Computer Science & Engineering
Syllabus
3
• Introduction to Big data, Handling large datasets
using Map-Reduce and Hadoop, Paraquet file
Format.
• Introduction to Hbase data model and hbase
region. Introduction to emerging database
technologies- Cloud Databases, Mobile Databases
• SQLite Database, XML Databases,Introduction of
Apache spark,Features and uses of Apache spark
Type of Data
1. Structured Data : have fixed schema, format eg-
RDBMS ,Excel sheet, Number
2. Unstructured Data : Not having fixed schema eg-
documents, metadata, audio, video, images,
unstructured text such as the body of an e-mail
message, Web page etc.
3. Semi Structured Data : form of structured data that
does not conform formal structure of data models
associated with relational databases or other forms of
data tables. eg- XML and JSON documents
Big Data
• What is Big Data
Big data is a term that describes the large volume of data
both structured and unstructured and have potential to
be mined that has the potential to be mined for
information
• Why we need Big Data
Big data dramatically increases both the number of data
sources and the variety and volume of data that is
useful for analysis. A non-relational system can be
used to produce analytics from big data
3 V’s of BigData
Traditional BIvs. Big Data
• Traditional Business Intelligence (BI) systems provide
various levels and kinds of analyses on structured data
but they are not designed to handle unstructured data.
Connect
HDFS – Data Storage Patern
SampleFile.avi
B1 B2
B3
B1
HDFS – Data Storage Pattern
SampleFile.avi
B1 B2 B3
Acknowledge
B2
B1 B1B1 B3
B2 B2 B3
B3
www.techdatasolution.co.in 27 info@techdatasolution.co.in
HDFS – Data Read Pattern
Clie nt
Connect
B1 B2
B1 B1 B3
B2 B2 B3
B3
HDFS – Data Read Pattern
Clie nt
B1 B2 B3
B1 B2
B1 B1 B3
B2 B2 B3
B3
HDFS – Data Read Pattern
Clie n t
SampleFile.avi
B1 B2 B3
Complete
B1 B2
B1 B1 B3
B2 B2 B3
B3
MapReduce
• MapReduce is a processing technique and a program
model for distributed computing based on java. The
MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and
converts it into another set of data, where individual
elements are broken down into tuples (key/value airs).
• A MapReduce job usually splits the input data-set into
independent chunks which are processed by the map
tasks in a completely parallel manner. The framework
sorts the outputs of the maps, which are then input to
the reduce tasks.
Map Phase
• Records from the data sourceare fed
into the map function as key*value pairs.
• map() produces one or more intermediate
values along with an output key from the
input.
• One map task for each InputSplit
generated by the InputFormat for the job.
• The framework then calls
map(WritableComparable, Writable,
OutputCollector, Reporter)
for each key/value pair in the InputSplit for that
task.
Reduce Phase
• After the map phase is over, all the intermediate values
for a given output key are combined together into a list.
• reduce() combines those intermediate values into one or
more final values for that same output key
• The number of reduces for the job is set by the user via
JobConf.setNumReduceTasks(int).
• Reducer has 3 primary phases:, sort and reduce.
I. shuffle -phase the framework fetches the relevant
partition of the output of all the mappers.
II. Sort-The framework groups Reducer inputs by keys.
III. Reduce- In this phase the reduce(WritableComparable,
Iterator, OutputCollector, Reporter) method is
called for each <key, (list of values)> pair in the grouped
inputs.
Map-Reduce multiple Reduce Task
JobTracker
• Works above HDFS consists of One JobTracker to
which
• Client applications submit MapReduce jobs.
• JobTracker pushes work out to available TaskTracker
nodes
in the cluster.
• Strives to keep the work as close to the data as possible
• Due to rack-aware file system, JobTracker knows which node
contains the data, and which other machines are nearby.
• JobTracker monitors the individual
TaskTrackers and the
submits back the overall status of the job back to the client.
TaskTracker
• TaskTracker runs on DataNode. Mostlyon
DataNodes. all
• Mapper and Reducer tasks are executed on DataNodes
administered by TaskTrackers.
• TaskTracker will be in constant communication with
the JobTracker signalling the progress of the task in
execution.
• TaskTracker failure is not considered fatal. When a
TaskTracker becomes unresponsive, JobTracker will
assign the task executed by the TaskTracker to another
node.
Map Reduce Data Flow Example: Word Count
Map
Reduce
Are 1
Hi, how are you? hi 1
I am good how 1 Are 2
Are[1 1] Hello 2
you 1
Hello[1 1] Hi 1
Hi[1] how 2
Are 1 how[1 1]
you[1 1] you 2
Hello 1
Hello Hello how are you? Hello 1
Not so good how 1
you 1 merged
Sorted
Output
Input Intermediate results
Hadoop 1.0 Vs Hadoop2.0
• In Hadoop 1.0, only run MapReduce framework jobs to
process the data stored in HDFS.
• Hadoop 2.0 came up with new framework YARN (Yet
another Resource Navigator), which provides ability to
run Non-MapReduce application.
-A APACHE HADOOP PROJECT
Introduction
• HBase is developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS(Hadoop Distributed
Filesystem) providing BigTable-like capabilities for Hadoop.
• Apache HBase began as a project by the company Powerset out of a need to
process massive amounts of data for the purposes of natural language search.
Why use hbase?
• Storing large amounts of data.
• The Write Ahead Log (WAL, for short) ensures that our
Hbase writes are reliable.
• Assign regions.
• REST/HTTP
• Apache Thrift