0% found this document useful (0 votes)
8 views

DBMS Unit-5

The document outlines the syllabus for a Database Management System course at MIT School of Computing, focusing on NoSQL databases, Big Data, and Hadoop. It covers various topics including data types, Hadoop architecture, MapReduce processing, and HBase data models. The document also discusses the benefits of using Hadoop for managing large datasets and introduces other tools within the Hadoop ecosystem.

Uploaded by

Alpha Ayush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

DBMS Unit-5

The document outlines the syllabus for a Database Management System course at MIT School of Computing, focusing on NoSQL databases, Big Data, and Hadoop. It covers various topics including data types, Hadoop architecture, MapReduce processing, and HBase data models. The document also discusses the benefits of using Hadoop for managing large datasets and introduces other tools within the Hadoop ecosystem.

Uploaded by

Alpha Ayush
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

MIT School of Computing

Department of Computer Science & Engineering

Third Year Engineering

21BTCS502-Database Management System

Class - T.Y.PLD(Division-)

AY 2023-2024
SEM-I

1
Unit – V

Introduction to Database Management Systems

2
MIT School of Computing
Department of Computer Science & Engineering

Syllabus

Motivations for No SQL Databases, Types of NoSQL databases, Operations in


NoSQL Introduction to Big data, Handling large datasets using Map-Reduce
and Hadoop. Introduction to Hbase data model and hbase region. Introduction
PLD
to emerging database technologies- Cloud Databases, Mobile Databases,
SQLite Database, XML Databases, Introduction of Apache spark, Features,
uses of Apache spark. Embedded databases, Recent embedded databases.

3
• Introduction to Big data, Handling large datasets
using Map-Reduce and Hadoop, Paraquet file
Format.
• Introduction to Hbase data model and hbase
region. Introduction to emerging database
technologies- Cloud Databases, Mobile Databases
• SQLite Database, XML Databases,Introduction of
Apache spark,Features and uses of Apache spark
Type of Data
1. Structured Data : have fixed schema, format eg-
RDBMS ,Excel sheet, Number
2. Unstructured Data : Not having fixed schema eg-
documents, metadata, audio, video, images,
unstructured text such as the body of an e-mail
message, Web page etc.
3. Semi Structured Data : form of structured data that
does not conform formal structure of data models
associated with relational databases or other forms of
data tables. eg- XML and JSON documents
Big Data
• What is Big Data
Big data is a term that describes the large volume of data
both structured and unstructured and have potential to
be mined that has the potential to be mined for
information
• Why we need Big Data
Big data dramatically increases both the number of data
sources and the variety and volume of data that is
useful for analysis. A non-relational system can be
used to produce analytics from big data
3 V’s of BigData
Traditional BIvs. Big Data
• Traditional Business Intelligence (BI) systems provide
various levels and kinds of analyses on structured data
but they are not designed to handle unstructured data.

• The Data Management done with RDBMS,


Datawarehouse.

• Hadoop gives solution to unstructured data.


Thinking at scale
• Need to process 100TB datasets.
• On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min

Need Efficient, Reliable and Usable framework


Apache Hadoop
• Hadoop is an open-source framework that allows to store and

process big data in a distributed environment across clusters of


computers using simple programming models. It is designed to
scale up from single servers to thousands of machines, each
offering local computation and storage

• Implements a computational paradigm named MapReduce


History of Hadoop
• Hadoop was created by Doug Cutting & Mike Cafarella
• Doug, who was working at Yahoo at the time, named it

after his son's toy elephant.


• Originally developed to support distribution for the

Nutch search engine project


• April 2008 Hadoop sorted 1 TB data in 209 seconds
help of 910 node cluster
BENEFITS
• Some of the reasons organizations use Hadoop is its’ ability to store,
manage and analyze vast amounts of structured and unstructured
data quickly, reliably, flexibly and at low-cost.
• Scalability and Performance – distributed processing of data local to
each node in a cluster enables Hadoop to store, manage, process and
analyze data at petabyte scale.
• Reliability – large computing clusters are prone to failure of individual
nodes in the cluster. Hadoop is fundamentally resilient – when a node fails
processing is re-directed to the remaining nodes in the cluster
and data is automatically re-replicated in preparation for future node
failures.
• Flexibility – unlike traditional relational database management
systems, you don’t have to created structured schemas before storing data.
You can store data in any format, including semi-structured or unstructured
formats, and then parse and apply schema to the data when read.
• Low Cost – unlike proprietary software, Hadoop is open source and runs
on low-cost commodity hardware.
Hadoop Architecture
Components of Core Hadoop
• Hadoop Distributed File system:
HDFS stores data on nodes in the cluster with the goal
of providing greater bandwidth across the cluster.

• Hadoop MapReduce: It is a computational paradigm


called Map/Reduce, which takes an application and
divides it into multiple fragments of work, each of
which can be executed on any node in the cluster
Hadoop Ecosystem
The Hadoop ecosystem includes other
tools to address particular needs
• Hive: A data warehouse infrastructure that provides data
summarization
• HBase: A scalable, distributed database that supports
structured data storage for large tables.
• Pig: A high-level data-flow language and
execution framework for parallel computation and
ad hoc querying.
• ZooKeeper: A high-performance coordination service for
distributed applications.
Hadoop Ecosystem
• Spark -Spark is both a programming model and a computing
model. It provides a gateway to in-memory computing for
Hadoop
• Common uses cases for Apache Spark include real-time queries, event stream
processing, iterative algorithms, complex operations and machine learning
• Oozie-Oozie is the workflow scheduler that was developed
as part of the Apache Hadoop project. It manages how
workflows start and execute, and also controls the execution
path
• Sqoop -Think of Sqoop as a front-end loader for big data.
Sqoop is a command-line interface that facilitates moving
bulk data from Hadoop into relational databases and other
structured data stores.
• Mahout-Mahout is a scalable machine learning library that
implements various different approaches machine learning.
• Ambari-Ambari was created to help manage Hadoop. It offers
support for many of the tools in the Hadoop ecosystem including
Hive, HBase, Pig, Sqoop and Zookeeper. The tool features a
management dashboard that keeps track of cluster health and can
help diagnose performance issues.
• Apache Kafka: It is a distributed streaming platform. It lets you
publish and subscribe to streams of records. Building real- time
streaming data pipelines that reliably get data between systems or
applications. Building real-time streaming applications that
transform or react to the streams of data
HDFS +Map-Reduce
Hadoop Distributed File System
• HDFS is a Java-based that
scalable
fileand reliable
system data storage, and it was designed
provid
to span large clusters of commodity servers.
es

• HDFS has demonstrated production scalability of up to


200 PB of storage and a single cluster of 4500 servers,
supporting close to a billion files and blocks.
Assumptions and Goals
• Hardware Failure - detection of faults and quick, automatic recovery

• Streaming Data Access - high throughput of data access

• Large Data Sets - HDFS is tuned to support large files

• Simple Coherency Model - write-once-read-many access model

• Moving Computation is Cheaper than Moving Data

• Portability Across Heterogeneous Hardware and Software


Platforms - easily portable from one platform to another
• NameNode and DataNodes
Namenode and Datanodes
Namenode
The system having the namenode acts as the master
server and it does the following tasks:

• Manages the file system namespace.

• Regulates client’s access to files.

• It also executes file system operations such as

renaming, closing, and opening files and directories.


Datanodes
• These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file

systems, as per client request.


• They also perform operations such as block creation,

deletion, and replication according to the instructions


of the namenode.
.
Block
• The file in a file system will be divided into one or
more segments and/or stored in individual data nodes.
• The default block size is 64MB, but it can be increased
as per the need to change in HDFS configuration
Advantages through the concept of
Faster calculation of assigning storage to a file
A file can be larger than a single node the network
Operates at the disk transfer rate (approximately)
No wastage of space
• Eg.: A 420 MB file will be stored as follows:
HDFS – Data Storage Patern
SampleFile.avi
B1 B2 B3
HDFS – Data Storage
Patern
SampleFile.avi
B1 B2 B3

Connect
HDFS – Data Storage Patern
SampleFile.avi
B1 B2
B3

B1
HDFS – Data Storage Pattern
SampleFile.avi
B1 B2 B3

Acknowledge

B2

B1 B1B1 B3

B2 B2 B3

B3

www.techdatasolution.co.in 27 info@techdatasolution.co.in
HDFS – Data Read Pattern
Clie nt

Connect

B1 B2

B1 B1 B3

B2 B2 B3

B3
HDFS – Data Read Pattern
Clie nt
B1 B2 B3

B1 B2

B1 B1 B3

B2 B2 B3

B3
HDFS – Data Read Pattern
Clie n t
SampleFile.avi
B1 B2 B3
Complete

B1 B2

B1 B1 B3

B2 B2 B3

B3
MapReduce
• MapReduce is a processing technique and a program
model for distributed computing based on java. The
MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and
converts it into another set of data, where individual
elements are broken down into tuples (key/value airs).
• A MapReduce job usually splits the input data-set into
independent chunks which are processed by the map
tasks in a completely parallel manner. The framework
sorts the outputs of the maps, which are then input to
the reduce tasks.
Map Phase
• Records from the data sourceare fed
into the map function as key*value pairs.
• map() produces one or more intermediate
values along with an output key from the
input.
• One map task for each InputSplit
generated by the InputFormat for the job.
• The framework then calls
map(WritableComparable, Writable,
OutputCollector, Reporter)
for each key/value pair in the InputSplit for that
task.
Reduce Phase
• After the map phase is over, all the intermediate values
for a given output key are combined together into a list.
• reduce() combines those intermediate values into one or
more final values for that same output key
• The number of reduces for the job is set by the user via
JobConf.setNumReduceTasks(int).
• Reducer has 3 primary phases:, sort and reduce.
I. shuffle -phase the framework fetches the relevant
partition of the output of all the mappers.
II. Sort-The framework groups Reducer inputs by keys.
III. Reduce- In this phase the reduce(WritableComparable,
Iterator, OutputCollector, Reporter) method is
called for each <key, (list of values)> pair in the grouped
inputs.
Map-Reduce multiple Reduce Task
JobTracker
• Works above HDFS consists of One JobTracker to
which
• Client applications submit MapReduce jobs.
• JobTracker pushes work out to available TaskTracker
nodes
in the cluster.
• Strives to keep the work as close to the data as possible
• Due to rack-aware file system, JobTracker knows which node
contains the data, and which other machines are nearby.
• JobTracker monitors the individual
TaskTrackers and the
submits back the overall status of the job back to the client.
TaskTracker
• TaskTracker runs on DataNode. Mostlyon
DataNodes. all
• Mapper and Reducer tasks are executed on DataNodes
administered by TaskTrackers.
• TaskTracker will be in constant communication with
the JobTracker signalling the progress of the task in
execution.
• TaskTracker failure is not considered fatal. When a
TaskTracker becomes unresponsive, JobTracker will
assign the task executed by the TaskTracker to another
node.
Map Reduce Data Flow Example: Word Count
Map
Reduce

Are 1
Hi, how are you? hi 1
I am good how 1 Are 2
Are[1 1] Hello 2
you 1
Hello[1 1] Hi 1
Hi[1] how 2
Are 1 how[1 1]
you[1 1] you 2
Hello 1
Hello Hello how are you? Hello 1
Not so good how 1
you 1 merged

Sorted

Output
Input Intermediate results
Hadoop 1.0 Vs Hadoop2.0
• In Hadoop 1.0, only run MapReduce framework jobs to
process the data stored in HDFS.
• Hadoop 2.0 came up with new framework YARN (Yet
another Resource Navigator), which provides ability to
run Non-MapReduce application.
-A APACHE HADOOP PROJECT
Introduction
• HBase is developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS(Hadoop Distributed
Filesystem) providing BigTable-like capabilities for Hadoop.
• Apache HBase began as a project by the company Powerset out of a need to
process massive amounts of data for the purposes of natural language search.
Why use hbase?
• Storing large amounts of data.

• Storing unstructured or variable column data.

• Big data with random read writes.


HBase vs. HDFS
• Both are distributed systems that scale
to hundreds or thousands of nodes

• HDFS is good for batch processing (scans over big


files)
• Not good for record lookup
• Not good for incremental addition of small
batches
• Not good for updates
HBase vs. HDFS
WHAT IS HBASE?
• HBase is a Java implementation of
Google’s BigTable.
• Google defines BigTable as a “sparse,
distributed, persistent multidimensional sorted
map.”
• Committers and contributors from diverse
organizations like
Facebook, Cloudera, StumbleUpon,
TrendMicro, Intel, Horton works, Continuity etc.
Sparse
• Sparse means that fields in rows can be
empty or NULL but that doesn’t bring HBase to a
screeching halt.

• HBase can handle the fact that we


don’t(yet) know that information.

• Sparse data is supported with no waste of


costly storage space.
Multidimensional sorted map
• Amap (also known as an associative array) is an abstract
collection of key-value pairs, where the key is unique.

• The keys are stored in HBase and sorted inbyte


lexicographical order.

• Each value can have multiple versions, which makes the


data model multidimensional. By default, data versions
are implemented with a timestamp.
HBase Data Model
• HBase data stores consist of one or more tables, which are
indexed by row keys.

• Data is stored in rows with columns, and rows can have


multiple versions. By default, data versioning for rows is
implemented with time stamps.

• Columns are grouped into column families, which must


be defined up front during table creation.

• Column families are grouped together on disk, so


grouping data with similar access patterns reduces overall
disk access and increases performance.
Data in Tabular Form
Name Home Office
Key First Last Phone Email Phone Email
101 Florian Krepsba 555- florian@ 666- fk@phc.c
ch 1212 wobegon. 1212 om
org
102 Marilyn Tollerud 555- 666-
1213 1213

103 Pastor Inqvist 555- inqvist@


1214 wels.org
HBASE data model
Hbase data model
• Column qualifiers are specific names assigned to our data
values.

• Unlike column families, column qualifiers can be virtually


unlimited in content, length and number.

• Because the number of column qualifiers is variable new


data can be added to column families on the fly, making
HBase flexible and highly scalable.
Hbase Data Model
• HBase stores the column qualifier with our value, and since
HBase doesn’t limit the number of column qualifiers we can
have, creating long column qualifiers can be quite costly in
terms of storage.

• Values stored in HBase are time stamped by default, which


means we have a way to identify different versions of
our data right out of the box.

• The versioned data is stored in decreasing order, so that the


most recent value is returned by default unless a query
specifies a particular timestamp.
Hbase architecture: region servers

• RegionServers are the software processes (often called


daemons) we activate to store and retrieve data in HBase. In
production environments, each RegionServer is deployed on
its own dedicated compute node.

• When a table grows beyond a configurable limit HBase


system automatically splits the table and distributes the
load to another RegionServer. This is called auto-
sharding.

• As tables are split, the splits become regions. Regions store a


range of key-value pairs, and each RegionServer
manages a configurable number of regions.
Hbase Architecture
Hbase architecture: region servers

• Each column family store object has a read cache called


the BlockCache and a write cache called the MemStore.

• The BlockCache helps with random read performance.

• The Write Ahead Log (WAL, for short) ensures that our
Hbase writes are reliable.

• The design of HBase is to flush column family data stored in


the MemStore to one HFile per flush. Then at configurable
intervals HFiles are combined into larger HFiles.
Hbase architecture: Compactions
Hbase architecture: compactions

• Minor compactions combine a configurable number of


smaller HFiles into one larger HFile.

• Minor compactions are important because without them,


reading a particular row can require many disk reads and
cause slow overall performance.

• Amajor compaction seeks to combine all HFiles into one


large HFile. In addition, a major compaction does the
cleanup work after a user deletes a record.
Hbase architecture: master server

Responsibilities of a Master Server:

• Monitor the region servers in the Hbase clusters.

• Handle metadata operations.

• Assign regions.

• Manage region server failover.


Hbase architecture: zookeeper

• HBase clusters can be huge and coordinating the


operations of the MasterServers, RegionServers, and
clients can be a daunting task, but that’s where
Zookeeper enters the picture.

• Zookeeper is adistributed cluster of servers that


collectively provides reliable coordination and
synchronization services for clustered applications.
Hbase architecture: CAP theorem

• HBase provides a high degree of


reliability. HBase can tolerate any failure and
stillfunction properly.

• HBase provides “Consistency” and


“Partition Tolerance” but is not always
“Available.”
Accessing hbase
• Java API

• REST/HTTP

• Apache Thrift

• Hive/Pig for analytics


Hbase api
Types of access:

• Gets: Gets a row’s data based on the row key.

• Puts: Inserts a row with data based on the row key.

• Scans: Finding all matching rows based on the row


key. Scan logic can be increased by using filters.
gets
puts
HBase vs. RDBMS
When to use HBase
Powered by hbase
Mobile Databases
• Recent advances in portable and wireless technology led to
mobile computing, a new dimension in data communication
and processing.
• Portable computing devices coupled with wireless
communications allow clients to access data from
virtually anywhere and at any time.
• There are a number of hardware and software problems
that must be resolved before the capabilities of
mobile computing can be fully utilized.
• Some of the software problems – which may involve data
management, transaction management, and database
recovery – have their origins in distributed database
systems.
Mobile Databases
• In mobile computing, the problems are more difficult,
mainly:
• The limited and intermittent connectivity afforded by wireless
communications.
• The limited life of the power supply(battery).
• The changing topology of the network.
• In addition, mobile computing introduces new architectural
possibilities and challenges.
Mobile Computing Architecture
• The general architecture of a mobile platform is illustrated in Fig 30.1.
Characteristics of Mobile Environments

• The characteristics of mobile computing include:


• Communication latency
• Intermittent connectivity
• Limited battery life
• Changing client location
Characteristics of Mobile Environments
• Client mobility also poses many data management
challenges.
• Servers must keep track of client locations in order to
efficiently route messages to them.
• Client data should be stored in the network location that
minimizes the traffic necessary to access it.
• The act of moving between cells must be transparent to the
client.
• The server must be able to gracefully divert the shipment of
data from one base to another, without the client noticing.
• Client mobility also allows new applications that are
location-based.
Data Management Issues
• From a data management standpoint, mobile computing may
be considered a variation of distributed computing. Mobile
databases can be distributed under two possible scenarios:
• The entire database is distributed mainly among the wired
components, possibly with full or partial replication.
• Abase station or fixed host manages its own database with a DBMS-
like functionality, with additional functionality for locating mobile
units and additional query and transaction management features to
meet the requirements of mobile environments.
• The database is distributed among wired and wireless
components.
• Data management responsibility is shared among base stations or
fixed hosts and mobile units.
Data Management Issues
• Data management issues as it is applied to mobile
databases:
• Data distribution and replication
• Transactions models
• Query processing
• Recovery and fault tolerance
• Mobile database design
• Location-based service
• Division of labor
• Security
Application: Intermittently Synchronized Databases

• Whenever clients connect – through a process known in


industry as synchronization of a client with a
server – they receive a batch of updates to be
installed on their local database.
• The primary characteristic of this scenario is that the clients are
mostly disconnected; the server is not necessarily able reach
them.
• This environment has problems similar to those in distributed and
client-server databases, and some from mobile
databases.
• This environment is referred to as Intermittently
Synchronized Database Environment (ISDBE).
SQLite
• SQLite is a software library that implements a
self- contained, serverless, zero-configuration,
transactional SQL database engine.
• SQLite is the most widely deployed SQL database engine in
the world.
• The source code for SQLite is in the public domain.
Why SQLite?
• SQLite does not require a separate server process or system to operate
(serverless).
• SQLite comes with zero-configuration, which means no setup or
administration needed.
• Acomplete SQLite database is stored in a single cross-platform disk file.
• SQLite is very small and light weight, less than 400KiB fully configured or less
than 250KiB with optional features omitted.
• SQLite is self-contained, which means no external dependencies.
• SQLite transactions are fully ACID-compliant, allowing safe access from
multiple processes or threads.
• SQLite supports most of the query language features found in SQL92
(SQL2) standard.
• SQLite is written in ANSI-C and provides simple and easy-to-use API.
• SQLite is available on UNIX (Linux, Mac OS-X, Android, iOS) and
Windows (Win32, WinCE, WinRT).
SQLite Limitations
• There are few unsupported features of SQL92 in SQLite
which are:
• RIGHT OUTER JOIN
• FULL OUTER JOIN
• ALTER TABLE
• Trigger support
• VIEWs
• GRANT and REVOKE
SQLite Commands
• DDL - Data Definition Language
• CREATE
• ALTER
• DROP

• DML - Data Manipulation Language


• INSERT
• UPDATE
• DELETE

• DQL - Data Query Language


• SELECT
Cloud database
• Acloud database is a database that typically
runs on a cloud computing platform, access to it is provided as a
service.
• Two cloud database environment models exist: traditional
and database as a service (DBaaS).
• In a traditional cloud model, a database runs on an IT
department's infrastructure via a virtual machine. Tasks of
Database oversight and management fall upon IT staffers of the
organization.
• By comparison, the DBaaS model is a fee – based subscription
service in which the database runs on the service provider's
physical infrastructure. Different service levels are usually
available.
Cloud database benefits
• Elimination of physical infrastructure : In a cloud Database
environment, the cloud computing provider of servers, storage
and other infrastructure is responsible for maintenance and
availability.
• Cost savings.
• Instantaneous scalability.
• Performance guarantees.
• Specialized expertise
• Latest technology.
• Failover support.
• Declining pricing
XML Databases
• XML Database is used to store huge amount of information in
the XML format. As the use of XML is increasing in every
field, it is required to have a secured place to store the
XMLdocuments.
• The data stored in the database can be queried using XQuery,
serialized, and exported into a desired format
• There are two major types of XMLdatabases
• XML- enabled
• Native XML (NXD)
XML - Enabled Database
• provided for the conversion of XML document.
• XML enable d database is the extens
nothing but ion
• This is a re la tiona l database, where
data is s tored in tables consisting
of rows and columns.
• The tables contain set of records, which in turn
consist of fields.
Native XMLDatabase
• Native XML database is based on the container rather than
table format. It can store large amount of XML document
and data.
• Native XML database is queried by the Xpath-expressions.
• Native XML database has an advantage over the XML-
enabled database. It is highly capable to store, query and
maintain the XML document than XML-enabled database.
Example

• Following example demonstrates XML database −


<?xml version = "1.0"?>
<contact-info>
<contact1>
<name>ABC</name>
<company>PQR</company>
<phone>(011) 123-4567</phone>
</contact1>
<contact2>
<name>XYZ</name>
<company>PQR</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy