0% found this document useful (0 votes)

12 views

Unit 4 - Data Science - Www.rgpvnotes.in

Uploaded by

DSync

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Unit 4 - Data Science - Www.rgpvnotes.in

Uploaded by

DSync

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Program : B.

E
Subject Name: Data Science
Subject Code: IT-8003
Semester: 8th
Downloaded from be.rgpvnotes.in

Unit IV: Data Science Tools

Cluster Architecture vs Traditional Architecture:

Building a highly reliable, high performance cluster using commodity servers takes massive
engineering effort. Data as well as the services running on the cluster must be made highly
available and fully protected from node failures without affecting the overall performance of
the cluster. Achieving such speed and reliability on commodity servers is even more
challenging because of lack of non-volatile RAM or any specialized connectivity between
nodes to deploy redundant data paths or RAID configurations.

Comparison to Other Platforms

There are many approaches that attempt to serve as an underlying platform, but invariably
these platforms hit scale, speed, and reliability issues. With Hadoop for instance, the
underlying Hadoop Distributed File System (HDFS) has many limitations.
The limited semantics of HDFS also surfaces a problem that Real-time is not possible with
HDFS. In order to make data visible in HDFS, you have to close the file immediately after
writing, so you will be forced to write a little amount, and close the file—and repeat the
process again. The problem is that you end up creating too many files, which is a serious,
documented problem with HDFS because of the centralized metadata storage architecture.

Furthermore, HDFS cannot truly support read-write via NFS because the NFS protocol cannot
i oke a lose ope atio o the file he iti g to HDFS. What this limitation means is
that HDFS has to take a guess as to when to close the file. If it guesses wrong, you will lose
data.
Traditional solutions
The traditional approach to side-stepping this fast re-sync problem is to use a dual-ported
disk that runs Raid-6 with idle spares. The dual ported disk array connects the two servers,
one to each of the ports. The servers use NVRAM, which is non-volatile RAM, to manage the
disks. The primary copies the NVRAM over to the replica continuously. When a primary or
replica fails, the other one takes over and there is nothing to re-sync, because it has
everything it needs for the drives to work. This scenario can work, but it is not scalable
because you now have to enter into large purchase contracts with a multi-year spare-parts
plan.

Page no: 1 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Figure 4.1: Cluster vs Traditional Architecture

HADOOP:
Hadoop is an open source distributed processing framework that manages data processing
and storage for big data applications running in clustered systems. It is at the centre of a
growing ecosystem of big data technologies that are primarily used to support advanced
analytics initiatives, including predictive analytics, data mining and machine learning
applications. Hadoop can handle various forms of structured and unstructured data, giving
users more flexibility for collecting, processing and analyzing data than relational databases
and data warehouses provide.

Formally known as Apache Hadoop, the technology is developed as part of an open source
project within the Apache Software Foundation (ASF). Commercial distributions of Hadoop
are currently offered by four primary vendors of big data platforms: Amazon Web Services
(AWS), Cloudera, Hortonworks and MapR Technologies. In addition, Google, Microsoft and
other vendors offer cloud-based managed services that are built on top of Hadoop and related
technologies.

HADOOP vs Distributed Database:

Hadoop is not a database, it is basically a distributed file system which is used to process and
store large data sets across the computer cluster. It has two main core components HDFS
(Hadoop Distributed File System) and MapReduce.

On the other hand, Distributed Database is a database which is used to store data in the form
of tables comprising of several rows and columns.

Difference between Hadoop and Distributed Database:

Page no: 2 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Like Hadoop, distributed database cannot be used when it comes to process and store a large
amount of data or simply big data. Following are some differences between Hadoop and
distributed database:

 Data Volume: Data volume means the quantity of data that is being stored and
processed. Distributed database works better when the volume of data is low (in
Gigabytes). But when the data size is huge i.e, in Terabytes and Petabytes, distributed
database fails to give the desired results.
On the other hand, Hadoop works better when the data size is big. It can easily process
and store large amount of data quite effectively as compared to the distributed database.
 Architecture: If we talk about the architecture, Hadoop has the core components-HDFS
(Hadoop Distributed File System), Hadoop MapReduce (a programming model to process
large data sets) and Hadoop YARN (used to manage computing resources in computer
clusters).
Distributed database possess ACID properties which are responsible to maintain and
ensure data integrity and accuracy when a transaction takes place in a database.
 Throughput: Throughput means the total volume of data processed in a particular period
of time so that the output is maximum. Distributed database fails to achieve a higher
throughput as compared to the Apache Hadoop Framework.
 Data Variety: Data Variety generally means the type of data to be processed. It may be
structured, semi-structured and unstructured. Hadoop has the ability to process and
store all variety of data whether it is structured, semi-structured or unstructured.
Although, it is mostly used to process large amount of unstructured data.
Distributed database is used only to manage structured and semi-structured data. It
cannot be used to manage unstructured data.
 Latency/ Response Time: Hadoop has higher throughput, you can quickly access batches
of large data sets than distributed database, but you cannot access a particular record
from the data set very quickly. Thus Hadoop is said to have low latency.
But the distributed database is comparatively faster in retrieving the information from
the data sets. It takes a very little time to perform the same function provided that there
is a small amount of data.
 Scalability: Distributed database provides vertical scalability which is also known as
“ ali g Up a a hi e. It ea s ou a add o e esou es o ha d a e s such as
memory, CPU to a machine in the computer cluster.
Whereas, Hadoop provides horizontal s ala ilit hi h is also k o as “ ali g Out a
machine. It means adding more machines to the existing computer clusters as a result of
which Hadoop becomes a fault tolerant. There is no single point of failure. Due to the
presence of more machines in the cluster, you can easily recover data irrespective of the
failure of one of the machines.
 Data Processing: Apache Hadoop supports OLAP (Online Analytical Processing), which is
used in Data Mining techniques. OLAP involves very complex queries and aggregations.
The data processing speed depends on the amount of data which can take several hours.
The database design is de-normalized having fewer tables. OLAP uses star schemas.
On the other hand, distributed database supports OLTP (Online Transaction Processing),
which involves comparatively fast query processing. The database design is highly

Page no: 3 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

normalized having a large number of tables. OLTP generally uses 3NF (an entity model)
schema.
 Cost: Hadoop is a free and open source software framework, you do t ha e to pa i
order to buy the license of the software.
Whereas distributed database is a licensed software, you have to pay in order to buy the
complete software license.

Building blocks of HADOOP

Building block of HADOOP includes HDFS, MapReduce and Yarn.

HDFS: The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications. It employs a NameNode and DataNode architecture to implement a
distributed file system that provides high-performance access to data across highly scalable
Hadoop clusters.

HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a reliable means
for managing pools of big data and supporting related big data analytics applications.

HDFS supports the rapid transfer of data between compute nodes. At its outset, it was closely
coupled with MapReduce, a programmatic framework for data processing. When HDFS takes
in data, it breaks the information down into separate blocks and distributes them to different
nodes in a cluster, thus enabling highly efficient parallel processing.

MapReduce: MapReduce is a framework using which we can write applications to process

huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable
manner.

MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.

YARN: Apache Hadoop YARN is the resource management and job scheduling technology in
the open source Hadoop distributed processing framework. One of Apache Hadoop's core
components, YARN is responsible for allocating system resources to the various applications
running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

YARN stands for Yet Another Resource Negotiator, but it's commonly referred to by the
acronym alone; the full name was self-deprecating humor on the part of its developers. The
technology became an Apache Hadoop subproject within the Apache Software Foundation
(ASF) in 2012 and was one of the key features added in Hadoop 2.0, which was released for
testing that year and became generally available in October 2013.

Page no: 4 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

HADOOP Data Types:

Hadoop provides classes that wrap the Java primitive types and implement the
WritableComparable and Writable Interfaces. They are provided in the org.apache.hadoop.io
package.

All the Writable wrapper classes have a get() and a set() method for retrieving and storing the
wrapped value.
Primitive Writable Classes: These are Writable Wrappers for Java primitive data types and
they hold a single primitive value that can be set either at construction or via a setter method.
All these primitive writable wrappers have get() and set() methods to read or write the
wrapped value. Below are primitive writable data types available in Hadoop.

BooleanWritable, ByteWritable, IntWritable, VIntWritable, FloatWritable, LongWritable,

VLongWritable, DoubleWritable
VIntWritable and VLongWritable are used for variable length Integer types and variable length
long types respectively.
Serialized sizes of the above primitive writable data types are same as the size of actual java
data type. So, the size of IntWritable is 4 bytes and LongWritable is 8 bytes.

Array Writable Classes: Hadoop provided two types of array writable classes, one for single-
dimensional and another for two-dimensional arrays. But the elements of these arrays must
be other writable objects like IntWritable or LongWritable only but not the java native data
types like int or float.
ArrayWritable, TwoDArrayWritable

Map Writable Classes: Hadoop provided below MapWritable data types which implement
java.util.Map interface

AbstractMapWritable: This is abstract or base class for other MapWritable classes.

MapWritable: This is a general purpose map mapping Writable keys to Writable values.

SortedMapWritable: This is a specialization of the MapWritable class that also implements

the SortedMap interface.

Other Writable Classes:

NullWritable: NullWritable is a special type of Writable representing a null value. No bytes are
read or written when a data type is specified as NullWritable. So, in Mapreduce, a key or a
value can be de la ed as a NullW ita le he e do t eed to use that field.

ObjectWritable: This is a general-purpose generic object wrapper which can store any objects
like Java primitives, String, Enum, Writable, null, or arrays.

Text: Text can be used as the Writable e ui ale t of ja a.la g.“t i g a d It s a size is 2 GB.
U like ja a s “t i g data t pe, Te t is uta le i Hadoop.

BytesWritable: BytesWritable is a wrapper for an array of binary data.

Page no: 5 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

GenericWritable: It is similar to ObjectWritable but supports only a few types. User need to
subclass this GenericWritable class and need to specify the types to support.

HADOOP Software Stack:

The Hadoop stack includes more than a dozen components, or subprojects, that are complex
to deploy and manage. Installation, configuration and production deployment at scale is
challenging.

Figure 4.2: HADOOP Software Stack

The main components include:

 Hadoop: Java software framework to support data-intensive distributed applications

 ZooKeeper: A highly reliable distributed coordination system
 MapReduce: A flexible parallel data processing framework for large data sets
 HDFS: Hadoop Distributed File System
 Oozie: A MapReduce job scheduler
 HBase: Key-value database
 Hive: A high-level language built on top of MapReduce for analyzing large data sets
 Pig: Enables the analysis of large data sets using Pig Latin. Pig Latinis a high-level language
compiled into MapReduce for parallel data processing.

Some good examples that display some or all of these characteristics include:

• Appli atio s that oil lots of data do i to o de ed o agg egated esults so ti g, o d

and phrase counts, building inverted indices mapping phrases to documents, phrase
searching among large document corpuses.

• Bat h a al ses fast e ough to satisf the eeds of ope atio al a d epo ti g appli atio s,
such as web traffic statistics or product recommendation analysis.

Page no: 6 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

• Ite ati e a al sis usi g data i i g a d a hi e lea i g algo ith s, su h as asso iation
rule analysis or k-means clustering, link analysis, classification, Naïve Bayes analysis.

• “tatisti al a al sis a d edu tio , su h as e log a al sis, o data p ofili g.

• Beha io al a al ses su h as li k st ea a al sis, dis o e i g o te t-distribution networks,

viewing behavior of video audiences.

• T a sfo atio s a d e ha e e ts, su h as auto-tagging social media, ETL processing, data

standardization.

Figure 4.3: HADOOP Stack

Deployment of Hadoop in Data Center:

Big data analytics has been a part of the data center conversation for the last few years. The
pote tial ehi d the assi e data olle tio i toda s te h olog a ket see s to e
li itless. Ea lie this ea , Ho to o ks fou de p edi ted that 2020, 75% of the Fo tu e
2000 companies will be running 1,000-node Hadoop clusters.
Although Hadoop might be the most popular big data solution worldwide, it is still
accompanied by deployment and management challenges such as scalability, achieving high
availability, flexibility, and being cost effective.

Page no: 7 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Effective Workload Management: Hadoop s ope sou e f a e o k allo s ou to sto e ast

amounts of data on multiple commodity cloud servers without the need for costly purpose-
built resources. Make sure that your infrastructure personnel are involved from the beginning
and that they are aware of the feasibility of using commodity servers. Otherwise, you can end
up overbuilding your cluster and make an unnecessary investment in proprietary data
handling equipment.

Hadoop is very much about pairing computation with data, which could mean returning to
some mainframe-era roots. Effective workload management is a necessary Hadoop best
practice.

Starting Small: We e all see the statisti s o a failed IT p oje ts due to thei le el of
complexity and costs. Implementing Hadoop can come with these same risks. The beauty of
Hadoop is that it allows for great scalability by just adding nodes as needed to a cluster in a
odula fashio . Although it s eas to add to a Hadoop luste , it is ot as easy to take away.
In addition, you may find that the specifications of your servers need to be altered based on
the results from and performance of your initial project, which is supported by Hadoop on an
o goi g asis. Ho e e , it s still eas to get carried away with building your first cluster.

Choosing a small project to run as a proof-of-concept (POC) allows development and

infrastructure staff to familiarize themselves with the inter-workings of this technology,
enabling the capability to support othe g oups ig data e ui e e ts i thei o ga izatio
with reduced implementation risks.

Cluster Monitoring: Although Hadoop offers some redundancy at the data and management
levels, there are still lots of moving parts that need to be monitored. Your cluster monitoring
needs to report on the whole cluster as well as on specific nodes. It also needs to be scalable
and able to automatically track an eventual increase in the amount of nodes in the cluster.
The pertinent metric data of the cluster is pro ided Hadoop s et i s, hi h a e eated
by a collection of runtime statistical information that is exposed by the Hadoop daemons.
You can use Nagios to monitor all of the nodes and services in a cluster. Nagios and Cacti can
work together to facilitate adjusting and adding checks to your Hadoop cluster such as
e ie i g ea h se e s disk health, looki g at the o e all pe fo a e of g oups of esou es,
and allowing you to segment performance tracking by application, department, team and
data-sensitivity level.

HADOOP Infrastructure:

There are many processes that run within a Hadoop cluster, however there are a few key
relationships that must be mentioned. NameNode and DataNode are HDFS components that
work in a master/slave mode. NameNode is a major component that controls HDFS whereas
DataNodes does the block replications, read/write operations and drives the workloads for
HDFS.

Page no: 8 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Figure 4.4: HADOOP Infrastructure

JobTracker and TaskTracker are also HDFS components that work in master/slave mode
where JobTracker tasks control the mapping and reducing tasks at individual nodes among
other tasks. The TaskTrackers run at the node levels and maintains communications with
JobTracker for all nodes within the cluster.

The other critical component is MapReduce computational layers. This is a complex set of
rules that Hadoop workloads depend on where massive volumes of data are mapped and
the edu ed fo effi ie t lookups, eads a d ites a oss all the odes. It s TaskT a ke s
responsibility to track at a local node, while the JobTracker oversees all the nodes in the
cluster.

HDFS Concept, Blocks, Name Node and Data Node:

Apache HDFS or Hadoop Distributed File System is a block-structured file system where each
file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of
one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode (Master node) and all the other
nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines
that support Java. Though one can run several DataNodes on a single machine, but in the
practical world, these DataNodes are spread across various machines.

Name Node: NameNode is the master node in the Apache Hadoop HDFS Architecture that
maintains and manages the blocks present on the DataNodes (slave nodes). NameNode is a
very highly available server that manages the File System Namespace and controls access to
files by clients. The HDFS architecture is built in such a way that the user data never resides
on the NameNode. The data resides on DataNodes only.

Page no: 9 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Figure 4.5: HDFS Architecture

Functions of NameNode:

 It is the master daemon that maintains and manages the DataNodes (slave nodes).
 It records the metadata of all the files stored in the cluster, e.g. the location of blocks
stored, the size of the files, permissions, hierarchy, etc. There are two files associated
with the metadata:
FsImage: It contains the complete state of the file system namespace since the start of
the NameNode.
EditLogs: It contains all the recent modifications made to the file system with respect to
the most recent FsImage.
 It records each change that takes place to the file system metadata. For example, if a file
is deleted in HDFS, the NameNode will immediately record this in the EditLog.
 It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster
to ensure that the DataNodes are live.
 It keeps a record of all the blocks in HDFS and in which nodes these blocks are located.
 The NameNode is also responsible to take care of the replication factor of all the blocks
which we will discuss in detail later in this HDFS tutorial blog.
 In case of the DataNode failure, the NameNode chooses new DataNodes for new replicas,
balance disk usage and manages the communication traffic to the DataNodes.

DataNode:

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is a commodity
hardware, that is, a non-expensive system which is not of high quality or high-availability. The
DataNode is a block server that stores the data in the local file ext3 or ext4.

Functions of DataNode:

 These are slave daemons or process which runs on each slave machine.
 The actual data is stored on DataNodes.

Page no: 10 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

 The DataNodes perform the low-le el ead a d ite e uests f o the file s ste s
clients.
 They send heartbeats to the NameNode periodically to report the overall health of HDFS,
by default, this frequency is set to 3 seconds.

Blocks:

Blocks are the nothing but the smallest continuous location on your hard drive where data is
stored. In general, in any of the File System, you store the data as a collection of blocks.
Similarly, HDFS stores each file as blocks which are scattered throughout the Apache Hadoop
cluster. The default size of each block is 128 MB in Apache Hadoop 2.x (64 MB in Apache
Hadoop 1.x) which you can configure as per your requirement.

It is not necessary that in HDFS, each file is stored in exact multiple of the configured block
size (128 MB, 256 MB etc.).

HBase Overview:

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable. HBase is a data model that is similar to
Google s ig ta le desig ed to p ovide quick random access to huge amounts of structured
data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).

Figure 4.6: HBase Overview

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System. One can store the data in HDFS either directly or through HBase.
Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of
the Hadoop File System and provides read and write access.

Features of HBase:

 HBase is linearly scalable.

 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.

Page no: 11 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

 It has easy java API for client.

 It provides data replication across clusters.

Hive:

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation t ook it up
and developed it further as an open source under the name Apache Hive. It is used by
different companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not a relational database, a design for OnLine Transaction Processing (OLTP) & a
language for real-time queries and row-level updates.
Features of Hive:

 It stores schema in a database and processed data into HDFS.

 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.

Figure 4.7: Hive Architecture

Cassandra:

Apache Cassandra is a highly scalable, high-performance distributed database designed to

handle large amounts of data across many commodity servers, providing high availability with
no single point of failure. It is a type of NoSQL database.

A NoSQL database (sometimes called as Not Only SQL) is a database that provides a
mechanism to store and retrieve data other than the tabular relations used in relational
databases. These databases are schema-free, support easy replication, have simple API,
eventually consistent, and can handle huge amounts of data.

Page no: 12 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Apache Cassandra is an open source, distributed and decentralized/distributed storage

system (database), for managing very large amounts of structured data spread out across the
world. It provides highly available service with no single point of failure.

Some important points related to Apache Cassandra:

 It is scalable, fault-tolerant, and consistent.

 It is a column-oriented database.
 Its dist i utio desig is ased o A azo s D a o a d its data odel o Google s
Bigtable.
 Created at Facebook, it differs sharply from relational database management systems.
 Cassandra implements a Dynamo-style replication model with no single point of failure,
ut adds a o e po e ful olu fa il data odel.
 Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Features of Cassandra:

 Elastic Scalability: Cassandra is highly scalable; it allows to add more hardware to

accommodate more customers and more data as per requirement.
 Always on Architecture: Cassandra has no single point of failure and it is continuously
available for business-critical applications that cannot afford a failure.
 Fast Linear-scale Performance: Cassandra is linearly scalable, i.e., it increases your
throughput as you increase the number of nodes in the cluster. Therefore it maintains a
quick response time.
 Flexible Data Storage: Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate changes
to your data structures according to your need.
 Easy Data Distribution: Cassandra provides the flexibility to distribute data where you
need by replicating data across multiple data centers.
 Transaction Support: Cassandra supports properties like Atomicity, Consistency,
Isolation, and Durability (ACID).
 Fast Writes: Cassandra was designed to run on cheap commodity hardware. It performs
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.

Hypertable:

Hypertable is a massively scalable database modeled after Google's Bigtable database.

Bigtable is part of a group of scalable computing technologies developed by Google.

Page no: 13 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Figure 4.8: Hypertable

Google File System (GFS): This is the lowest layer of the Google scalable computing stack. It
is a filesystem much like any other and allows for the creation of files and directories. The
primary innovation of the Google filesystem is that it is massively scalable and highly available.
It achieves high availability by replicating file data across three physical machines which
means that it can lose up to two of the machines holding replicas and the data is still available.
Hadoop provides an open source implementation of the GFS called HDFS.
MapReduce: This is a parallel computation framework designed to efficiently process data in
the GFS. It provides a way to run a large amount of data through a piece of code (map) in
parallel by pushing the code out to the machines where the data resides. It also includes a
final aggregation step (reduce) which provides a way to re-order the data based on any
arbitrary field. Hadoop provides an open source implementation of MapReduce.

Bigtable: This is Google's scalable database. It provides a way to create massive tables of
information indexed by a primary key. As of this writing, over 90% of Google's web services
are built on top of Bigtable, including Search, Google Earth, Google Analytics, Google Maps,
Gmail, Orkut, YouTube, and many more. Hypertable is a high performance, open source
implementation of Bigtable.

Sawzall: This is a runtime scripting language that sits on top of the whole stack and provides
the ability to perform statistical analysis in an easily expressible way over large data sets.
Open source projects such as Hive and Pig provide similar functionality.

Page no: 14 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Figure 4.9: Hypertable System Overview

Sqoop:

The traditional application management system, that is, the interaction of applications with
relational database using RDBMS, is one of the sources that generate Big Data.

When Big Data storages and analysers such as MapReduce, Hive, HBase, Cassandra, Pig, etc.
of the Hadoop ecosystem came into picture, they required a tool to interact with the
relational database servers for importing and exporting the Big Data residing in them. Here,
Sqoop occupies a place in the Hadoop ecosystem to provide feasible interaction between
elatio al data ase se e a d Hadoop s HDF“.

“ oop − “QL to Hadoop a d Hadoop to “QL

Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It
is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
export from Hadoop file system to relational databases. It is provided by the Apache Software
Foundation.

Figure 4.10: SQOOP Working

Page no: 15 Follow us on facebook to get real-time updates from RGPV

Downloaded from be.rgpvnotes.in

Sqoop Import: The import tool imports individual tables from RDBMS to HDFS. Each row in a
table is treated as a record in HDFS. All records are stored as text data in text files or as binary
data in Avro and Sequence files.

Sqoop Export: The export tool exports a set of files from HDFS back to an RDBMS. The files
given as input to Sqoop contain records, which are called as rows in table. Those are read and
parsed into a set of records and delimited with user-specified delimiter.

Page no: 16 Follow us on facebook to get real-time updates from RGPV

We hope you find these notes useful.
You can get previous year question papers at
https://qp.rgpvnotes.in .

If you have any queries or you want to submit your

study notes please write us at
rgpvnotes.in@gmail.com

New Text Document 2
100% (3)
New Text Document 2
7 pages
Math 1090 Linear Programming Project
0% (1)
Math 1090 Linear Programming Project
3 pages
QR CODE-BOT Draft V1.0 - 271120192
No ratings yet
QR CODE-BOT Draft V1.0 - 271120192
19 pages
The Goal Question Metric Approach
No ratings yet
The Goal Question Metric Approach
14 pages
UNIT II HADOOP WITH HDFS
No ratings yet
UNIT II HADOOP WITH HDFS
22 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
BDAunit-II
No ratings yet
BDAunit-II
4 pages
UNIT-2 BIG DATA
No ratings yet
UNIT-2 BIG DATA
10 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
04
No ratings yet
04
23 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
BDA-3
No ratings yet
BDA-3
70 pages
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
From Everand
Mastering Big Data and Hadoop: From Basics to Expert Proficiency
William Smith
No ratings yet
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
RDBMS and Hadoop
No ratings yet
RDBMS and Hadoop
5 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Hadoop Main
No ratings yet
Hadoop Main
19 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
HADOOP
No ratings yet
HADOOP
18 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
No ratings yet
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
9 pages
HADOOP
No ratings yet
HADOOP
10 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Big Assignment
No ratings yet
Big Assignment
8 pages
bda unit 4-1
No ratings yet
bda unit 4-1
64 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
No ratings yet
Guided By:-Prof. K. Kakwani: Payal M. Wadhwani
24 pages
Data Migration From RDBMS To Hadoop: Platform Migration Approach
No ratings yet
Data Migration From RDBMS To Hadoop: Platform Migration Approach
25 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Hadoop
No ratings yet
Hadoop
11 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
21ai402 Data Analytics Unit-2
No ratings yet
21ai402 Data Analytics Unit-2
44 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
BDA Unit-4 Part-1 HDFS,MapReduce
No ratings yet
BDA Unit-4 Part-1 HDFS,MapReduce
76 pages
Top 500 Data Engineering Interview Questions
No ratings yet
Top 500 Data Engineering Interview Questions
126 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
Big Data Analysis IAT-1
No ratings yet
Big Data Analysis IAT-1
43 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
No ratings yet
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
85 pages
Week_1_Report
No ratings yet
Week_1_Report
9 pages
Major_Report_Final
No ratings yet
Major_Report_Final
60 pages
Week 1 MCQ Evaluation
No ratings yet
Week 1 MCQ Evaluation
27 pages
Research_paper_new
No ratings yet
Research_paper_new
6 pages
Geo - Module 9 Lesson 2
No ratings yet
Geo - Module 9 Lesson 2
4 pages
Full Download Optimization of Process Flowsheets through Metaheuristic Techniques 1st Edition José María Ponce-Ortega PDF DOCX
100% (4)
Full Download Optimization of Process Flowsheets through Metaheuristic Techniques 1st Edition José María Ponce-Ortega PDF DOCX
50 pages
What's in the Box- Galaxy Tab A9+
No ratings yet
What's in the Box- Galaxy Tab A9+
3 pages
AI Assignments 1 2 3
No ratings yet
AI Assignments 1 2 3
1 page
TIM RF VoLTE Service Solution
No ratings yet
TIM RF VoLTE Service Solution
34 pages
Numbers Flashcards
No ratings yet
Numbers Flashcards
10 pages
Abstract Cybersecurity
50% (2)
Abstract Cybersecurity
2 pages
NavEdit Manual
No ratings yet
NavEdit Manual
52 pages
Graphics Slides
No ratings yet
Graphics Slides
206 pages
17840-U MX1101 Manual
No ratings yet
17840-U MX1101 Manual
9 pages
PowerHA-SystemMirrow-VM-Recovery-Manager-Db2Mirror_Steve-Finnes
No ratings yet
PowerHA-SystemMirrow-VM-Recovery-Manager-Db2Mirror_Steve-Finnes
77 pages
Visual C++ and MFC Windows Programming
No ratings yet
Visual C++ and MFC Windows Programming
76 pages
Rms - Restaurant Management System
No ratings yet
Rms - Restaurant Management System
28 pages
Os Unit 1
No ratings yet
Os Unit 1
95 pages
Choa
No ratings yet
Choa
29 pages
(Đề thi có 04 trang:) Thời gian làm bài: 60 phút không kể thời gian phát đề
No ratings yet
(Đề thi có 04 trang:) Thời gian làm bài: 60 phút không kể thời gian phát đề
4 pages
14+ FREE Construction Flow Chart Templates - PDF, DOC, Word
No ratings yet
14+ FREE Construction Flow Chart Templates - PDF, DOC, Word
17 pages
ENGG 143.02 Lab Activity 2 - Manlapaz
No ratings yet
ENGG 143.02 Lab Activity 2 - Manlapaz
3 pages
Session 4 Class Activities / Tasks: Topic: Install, Configure, and Test The Functionality of Virtual Machines
No ratings yet
Session 4 Class Activities / Tasks: Topic: Install, Configure, and Test The Functionality of Virtual Machines
7 pages
DX Diag
No ratings yet
DX Diag
43 pages
SCSA1303
No ratings yet
SCSA1303
164 pages
FMM125 Quick Manual v1.5
No ratings yet
FMM125 Quick Manual v1.5
16 pages
Advanced Machine Learning Challenge5
No ratings yet
Advanced Machine Learning Challenge5
22 pages
CN - Wps.moffice Eng
No ratings yet
CN - Wps.moffice Eng
3 pages
ITEC244 ServerSideValidation 202410
No ratings yet
ITEC244 ServerSideValidation 202410
7 pages
AMERICAN HEALTH NETWORK INC. New York (US) OpenCorporates
No ratings yet
AMERICAN HEALTH NETWORK INC. New York (US) OpenCorporates
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit 4 - Data Science - Www.rgpvnotes.in

Uploaded by

Unit 4 - Data Science - Www.rgpvnotes.in

Uploaded by

Program : B.

Unit IV: Data Science Tools

Cluster Architecture vs Traditional Architecture:

Comparison to Other Platforms

Page no: 1 Follow us on facebook to get real-time updates from RGPV

Figure 4.1: Cluster vs Traditional Architecture

HADOOP vs Distributed Database:

Difference between Hadoop and Distributed Database:

Page no: 2 Follow us on facebook to get real-time updates from RGPV

Page no: 3 Follow us on facebook to get real-time updates from RGPV

Building blocks of HADOOP

Building block of HADOOP includes HDFS, MapReduce and Yarn.

MapReduce: MapReduce is a framework using which we can write applications to process

Page no: 4 Follow us on facebook to get real-time updates from RGPV

HADOOP Data Types:

BooleanWritable, ByteWritable, IntWritable, VIntWritable, FloatWritable, LongWritable,

AbstractMapWritable: This is abstract or base class for other MapWritable classes.

SortedMapWritable: This is a specialization of the MapWritable class that also implements

Other Writable Classes:

BytesWritable: BytesWritable is a wrapper for an array of binary data.

Page no: 5 Follow us on facebook to get real-time updates from RGPV

HADOOP Software Stack:

Figure 4.2: HADOOP Software Stack

The main components include:

 Hadoop: Java software framework to support data-intensive distributed applications

• Appli atio s that oil lots of data do i to o de ed o agg egated esults so ti g, o d

Page no: 6 Follow us on facebook to get real-time updates from RGPV

• “tatisti al a al sis a d edu tio , su h as e log a al sis, o data p ofili g.

• Beha io al a al ses su h as li k st ea a al sis, dis o e i g o te t-distribution networks,

• T a sfo atio s a d e ha e e ts, su h as auto-tagging social media, ETL processing, data

Figure 4.3: HADOOP Stack

Deployment of Hadoop in Data Center:

Page no: 7 Follow us on facebook to get real-time updates from RGPV

Effective Workload Management: Hadoop s ope sou e f a e o k allo s ou to sto e ast

Choosing a small project to run as a proof-of-concept (POC) allows development and

Page no: 8 Follow us on facebook to get real-time updates from RGPV

Figure 4.4: HADOOP Infrastructure

HDFS Concept, Blocks, Name Node and Data Node:

Page no: 9 Follow us on facebook to get real-time updates from RGPV

Figure 4.5: HDFS Architecture

Page no: 10 Follow us on facebook to get real-time updates from RGPV

Figure 4.6: HBase Overview

 HBase is linearly scalable.

Page no: 11 Follow us on facebook to get real-time updates from RGPV

 It has easy java API for client.

 It stores schema in a database and processed data into HDFS.

Figure 4.7: Hive Architecture

Apache Cassandra is a highly scalable, high-performance distributed database designed to

Page no: 12 Follow us on facebook to get real-time updates from RGPV

Apache Cassandra is an open source, distributed and decentralized/distributed storage

Some important points related to Apache Cassandra:

 It is scalable, fault-tolerant, and consistent.

 Elastic Scalability: Cassandra is highly scalable; it allows to add more hardware to

Hypertable is a massively scalable database modeled after Google's Bigtable database.

Page no: 13 Follow us on facebook to get real-time updates from RGPV

Figure 4.8: Hypertable

Page no: 14 Follow us on facebook to get real-time updates from RGPV

Figure 4.9: Hypertable System Overview

“ oop − “QL to Hadoop a d Hadoop to “QL

Figure 4.10: SQOOP Working

Page no: 15 Follow us on facebook to get real-time updates from RGPV

Page no: 16 Follow us on facebook to get real-time updates from RGPV

If you have any queries or you want to submit your

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.