Module 1 - Introduction To Big Data
Module 1 - Introduction To Big Data
Slide 2
Units of Data
Data Generated by Social media platforms
ᗍ billions of users
ᗍ Generates PBs of data per day
ᗍ Fires millions queries on that every day
Slide 4
Data Generated by
Entertainment/Infotainment platforms
Slide 5
Space Agencies
Slide 6
What is Big Data?
ᗍ Huge Amount of Data (Terabytes or Petabytes)
https://simplicable.com/new/data-veracity Slide 7
Slide 8
What is Unstructured Data?
Slide 11
Structured and Unstructured Data
Slide 13
Batch Processing
ᗍ Processing transactions in a group or batch
ᗍ Following three phases are common to batch processing or business analytics project, irrespective of the type
of data (structured or unstructured)
Slide 14
Data
Collection
Unstructure
d Data
Sqoop
Structure
d Data
Slide 15
Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into the Hadoop Distributed
File System (HDFS). It has a simple and flexible architecture based on streaming data
flows; and is robust and fault tolerant with tunable reliability mechanisms for failover
and recovery.
YARN coordinates data ingest from Apache Flume and other services that deliver raw
data into an Enterprise Hadoop cluster
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage
Sqoop
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is
used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
export from Hadoop file system to relational databases.
MapReduce
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for
distributed processing of large data sets on computing clusters. It is a
sub-project of the Apache Hadoop project. Apache Hadoop is an open-
source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple
programming models. MapReduce is the core component for data
processing in Hadoop framework.
MapReduce
Pig
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers already familiar with scripting
languages and SQL.
Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',') as
(id:int,name:chararray,city:chararray); Dump student;
Data
Presentation
Business
Analytics / Batch
Pig
Processing
System
Data Processing
Output
Slide 21
What is
Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming mode
Slide 22
Hadoop Key
Characteristics
Reliabl
e
Flexible
Slide 23
Hadoop
Ecosystem Apache Oozie
(Workflow)
Hive Pig Latin Other
DW System Data Analysis YARN
Frameworks HBase
MapReduce Framework (MPI, GIRAPH)
YARN
Cluster
Resource Management
HDFS
(Hadoop
Flum Distributed File Sqoo
System)
e p
Import Or
Export
Unstructured or Structured Data
Semi-Structured data Slide 24
Hadoop versions with history
https://archive.apache.org/dist/hadoop/core/
Slide 25
Hadoop 2.x Core
Components
Slide 26
Hadoop 1.x Vs Hadoop 2.x
Slide 27
Hadoop 3.x Core Components
A major improvement in Hadoop 3.0 is related to the way YARN
works and what it can support. Hadoop’s resource manager YARN
was introduced in Hadoop 2.0 to make hadoop clusters run
efficiently. In hadoop 3.0, YARN is coming off with multiple
enhancements in the following areas –
•Support for long running services with the need to consolidate
infrastructure.
•Better resource isolation for disk and network, resource
utilization, user experiences, docker opportunities and elasticity.
•YARN Timeline Service Rearchitecture to ATS v2
Slide 28
Difference between 2.x and 3.x
Slide 29
Difference between 2.x and 3.x
Slide 30
Hadoop 2.x Core Components
Slide 31
Components
YARN- Apache Yarn – “Yet Another Resource Negotiator” is
the
resource management layer of Hadoop.
The Yarn was introduced in Hadoop 2.x. Yarn allows
different data processing engines like graph processing,
interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop
Distributed File System).
Apart from resource management, Yarn also does job
Scheduling.
Yarn extends the power of Hadoop to other evolving
technologies. Slide 32
Components
HDFS Cluster- A cluster is a collection of nodes. A node is a
process running on a virtual or physical machine or in a
container.
When you run Hadoop in local node it writes data to the local
file system instead of HDFS (Hadoop Distributed File System).
Slide 33
Components
Node- A node is a process running on a virtual or physical
machine or in a container. We say process because a code
would be running other programs beside Hadoop.
Slide 34
Components
Resource Manager - The Resource Manager is the core
component of YARN
Slide 35
Components
Name Node - NameNode is the centerpiece of
HDFS,NameNode is also known as the Master
.NameNode only stores the metadata of HDFS – the directory
tree of all files in the file system, and tracks the files across
the cluster.
Slide 36
Components
Node Manager - the NodeManager is more of a generic and
efficient version of TaskTracker (of Hadoop1 architecture)
which is more flexible than TaskTracker.
In contrast to fixed number of slots for map and reduce tasks
in MRV1, the NodeManager of MRV2 has a number
of dynamically created resource containers.
Slide 37
Components
Data Node - DataNode is responsible for storing the actual
data in HDFS.
DataNode is also known as the Slave,NameNode and
DataNode are in constant communication.
Slide 38
Secondary
NameNode
Metadata
Secondary NameNode:
NameNode
ᗍ In HDFS 1.0, not a hot standby for the NameNode
ᗍ By Default connects to NameNode every hour*
ᗍ Housekeeping, backup of NameNode metadata
ᗍ Saved metadata is used to bring up the
Secondary NameNode
Secondary It'll take metadata
every hour and
N ameN ode will make it
secure
Slide 39
Thank you