100% found this document useful (1 vote)

155 views

Module 1 - Introduction To Big Data

This document provides an introduction to big data and Hadoop. It discusses what big data is, the challenges it poses, and different types of data like structured, unstructured, and semi-structured data. It also discusses Hadoop capabilities like distributed processing and storage. Key components of Hadoop include HDFS for storage, YARN for distributed processing, MapReduce for distributed computations, Pig and Hive for data analysis, Sqoop and Flume for data transfer. Hadoop has evolved from version 1 which used MapReduce for both processing and storage to version 2 which separates these into YARN and HDFS respectively and version 3 which further improves YARN.

Uploaded by

raghunath sastry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

155 views

Module 1 - Introduction To Big Data

Uploaded by

raghunath sastry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Big Data and Hadoop

Introduction to Big Data and

Hadoop
Session Objectives
This Session will help you to:

ᗍ Understand what is Big Data?

ᗍ List the challenges associated with Big Data
ᗍ Understand the difference between Real-time and Batch
Processing
ᗍ Hadoop capabilities
ᗍ Hadoop ecosystem

Slide 2
Units of Data
Data Generated by Social media platforms
ᗍ billions of users
ᗍ Generates PBs of data per day
ᗍ Fires millions queries on that every day

Slide 4
Data Generated by
Entertainment/Infotainment platforms

Slide 5
Space Agencies

NASA Centre for Climate Simulation (NCSS) stores

32 Petabytes of data comprising of climatic
observations

Slide 6
What is Big Data?
ᗍ Huge Amount of Data (Terabytes or Petabytes)

ᗍ Big data is the term for a collection of data sets

so large and complex that it becomes difficult to
process using on-hand database management
tools or traditional data processing applications

ᗍ The challenges include capture, curation, storage,

search, sharing, transfer, analysis, and visualization

Data veracity is the degree to which data is accurate,

precise and trusted. Data is often viewed as certain and
reliable. The reality of problem spaces, data sets and
operational environments is that data is often uncertain,
imprecise and difficult to trust.

https://simplicable.com/new/data-veracity Slide 7
Slide 8
What is Unstructured Data?

Unstructured data is essentially everything else.

Unstructured data has internal structure but is not
structured via pre-defined data models or schema. It may
be textual or non-textual, and human- or machine-
generated. It may also be stored within a non-relational
database like NoSQL.
.
What is Unstructured Data?
Typical human-generated unstructured data includes:
•Text files: Word processing, spreadsheets, presentations, email, logs.
•Email: Email has some internal structure thanks to its metadata, and we sometimes
refer to it as semi-structured. However, its message field is unstructured and
traditional analytics tools cannot parse it.
•Social Media: Data from Facebook, Twitter, LinkedIn.
•Website: YouTube, Instagram, photo sharing sites.
•Mobile data: Text messages, locations.
•Communications: Chat, IM, phone recordings, collaboration software.
•Media: MP3, digital photos, audio and video files.
•Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:

•Satellite imagery: Weather data, land forms, military movements.
•Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
•Digital surveillance: Surveillance photos and video.
•Sensor data: Traffic, weather, oceanographic sensors.
IBM’s Definition of
Big Data

Variety Photo Web Video Audio

MB GB TB PB Volume

Slide 11
Structured and Unstructured Data

ᗍ 2,500 exabytes of new information in 2012 with internet as primary driver

ᗍ Digital universe grew by 62% last year to 800K petabytes and will go to 1.2 “Zettabytes” this
year Slide 12
What is
Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming mode

It is an Open-source Data Management with scale-out storage and distributed

processing

Slide 13
Batch Processing
ᗍ Processing transactions in a group or batch
ᗍ Following three phases are common to batch processing or business analytics project, irrespective of the type
of data (structured or unstructured)

Data Data Data

Collection Preparation Presentation

Slide 14
Data
Collection

Real Time System Business

Analytics / Batch
Flume Processing
System

Unstructure
d Data
Sqoop

Structure
d Data

Slide 15
Flume
Apache Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into the Hadoop Distributed
File System (HDFS). It has a simple and flexible architecture based on streaming data
flows; and is robust and fault tolerant with tunable reliability mechanisms for failover
and recovery.

YARN coordinates data ingest from Apache Flume and other services that deliver raw
data into an Enterprise Hadoop cluster

Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage
Sqoop
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is
used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and
export from Hadoop file system to relational databases.
MapReduce
Hadoop MapReduce (Hadoop Map/Reduce) is a software framework for
distributed processing of large data sets on computing clusters. It is a
sub-project of the Apache Hadoop project. Apache Hadoop is an open-
source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple
programming models. MapReduce is the core component for data
processing in Hadoop framework.
MapReduce
Pig
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers already familiar with scripting
languages and SQL.

Sample_script.pig
student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING PigStorage(',') as
(id:int,name:chararray,city:chararray); Dump student;
Data
Presentation

Business
Analytics / Batch
Pig
Processing
System

Data Processing

Output

Slide 21
What is
Hadoop?
Apache Hadoop is a framework that allows the distributed processing of large data sets
across clusters of commodity computers using a simple programming mode

It is an Open-source Data Management with scale-out storage and distributed

processing

Slide 22
Hadoop Key
Characteristics
Reliabl
e

Scalabl Characteristic Economical

e s

Flexible

Slide 23
Hadoop
Ecosystem Apache Oozie
(Workflow)
Hive Pig Latin Other
DW System Data Analysis YARN
Frameworks HBase
MapReduce Framework (MPI, GIRAPH)

YARN
Cluster
Resource Management

HDFS
(Hadoop
Flum Distributed File Sqoo
System)
e p
Import Or
Export
Unstructured or Structured Data
Semi-Structured data Slide 24
Hadoop versions with history
https://archive.apache.org/dist/hadoop/core/

Slide 25
Hadoop 2.x Core
Components

ᗍ Entire infrastructure is divided into 2 parts - storage and processing

 ᗍ Hadoop Distributed File System (HDFS) provides the storage mechanism.
Yet Another Resource Negotiator (YARN) provides the processing part
ᗍ In total 5 Hadoop Daemons - 3 for HDFS and 2 for YARN. They work in a
master and slave mode
 ᗍ Name Node, Secondary Name Node (Masters) and Data Nodes (Slaves) for
HDFS and Resource Manager (Master) and Node Managers (Slaves) for YARN

Slide 26
Hadoop 1.x Vs Hadoop 2.x

Slide 27
Hadoop 3.x Core Components
A major improvement in Hadoop 3.0 is related to the way YARN
works and what it can support. Hadoop’s resource manager YARN
was introduced in Hadoop 2.0 to make hadoop clusters run
efficiently. In hadoop 3.0, YARN is coming off with multiple
enhancements in the following areas –
•Support for long running services with the need to consolidate
infrastructure.
•Better resource isolation for disk and network, resource
utilization, user experiences, docker opportunities and elasticity.
•YARN Timeline Service Rearchitecture to ATS v2

Slide 28
Difference between 2.x and 3.x

Slide 29
Difference between 2.x and 3.x

Slide 30
Hadoop 2.x Core Components

Resourc Node Node Node Node

YARN
e Manager Manager Manager Manager
Manager

HDFS Admin Node Data Data Data Data

Cluster Name Node Node Node Node
Node

Slide 31
Components
YARN- Apache Yarn – “Yet Another Resource Negotiator” is
the
resource management layer of Hadoop.
The Yarn was introduced in Hadoop 2.x. Yarn allows
different data processing engines like graph processing,
interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop
Distributed File System).
Apart from resource management, Yarn also does job
Scheduling.
Yarn extends the power of Hadoop to other evolving
technologies. Slide 32
Components
HDFS Cluster- A cluster is a collection of nodes. A node is a
process running on a virtual or physical machine or in a
container.
When you run Hadoop in local node it writes data to the local
file system instead of HDFS (Hadoop Distributed File System).

Slide 33
Components
Node- A node is a process running on a virtual or physical
machine or in a container. We say process because a code
would be running other programs beside Hadoop.

Slide 34
Components
Resource Manager - The Resource Manager is the core
component of YARN

Slide 35
Components
Name Node - NameNode is the centerpiece of
HDFS,NameNode is also known as the Master
.NameNode only stores the metadata of HDFS – the directory
tree of all files in the file system, and tracks the files across
the cluster.

Slide 36
Components
Node Manager - the NodeManager is more of a generic and
efficient version of TaskTracker (of Hadoop1 architecture)
which is more flexible than TaskTracker.
In contrast to fixed number of slots for map and reduce tasks
in MRV1, the NodeManager of MRV2 has a number
of dynamically created resource containers.

Slide 37
Components
Data Node - DataNode is responsible for storing the actual
data in HDFS.
DataNode is also known as the Slave,NameNode and
DataNode are in constant communication.

Slide 38
Secondary
NameNode
Metadata

Secondary NameNode:
NameNode
ᗍ In HDFS 1.0, not a hot standby for the NameNode
ᗍ By Default connects to NameNode every hour*
ᗍ Housekeeping, backup of NameNode metadata
ᗍ Saved metadata is used to bring up the
Secondary NameNode
Secondary It'll take metadata
every hour and
N ameN ode will make it
secure

Slide 39
Thank you

12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
Andarini Aakattukune Kala by Dale Carnegie
82% (22)
Andarini Aakattukune Kala by Dale Carnegie
287 pages
EB2406 - Teradata PDF
No ratings yet
EB2406 - Teradata PDF
18 pages
BP B2 Tests Unit4
No ratings yet
BP B2 Tests Unit4
4 pages
Analytics, Decision Support, and Artificial Intelligence Brainpower For Your Business
No ratings yet
Analytics, Decision Support, and Artificial Intelligence Brainpower For Your Business
33 pages
Drager Primus Service Mode
100% (4)
Drager Primus Service Mode
28 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Data Mining
100% (1)
Data Mining
29 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Map Reduce
100% (1)
Map Reduce
33 pages
Data Model: Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel
100% (1)
Data Model: Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel
71 pages
Data Warehousing&Data Mining
No ratings yet
Data Warehousing&Data Mining
170 pages
DW Life Cycle
No ratings yet
DW Life Cycle
114 pages
What's A Data Warehouse
No ratings yet
What's A Data Warehouse
24 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
68 pages
Scope, and The Inter-Relationships Among These Entities
No ratings yet
Scope, and The Inter-Relationships Among These Entities
12 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
18 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Machine Learning For Automation Software Testing Challenges, Use Cases Advantages & Disadvantages
No ratings yet
Machine Learning For Automation Software Testing Challenges, Use Cases Advantages & Disadvantages
7 pages
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
No ratings yet
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
69 pages
Scaling Memcache at Facebook - Slides
No ratings yet
Scaling Memcache at Facebook - Slides
28 pages
MDX Tutorial
100% (1)
MDX Tutorial
31 pages
Data Warehousing Components - L3 - L4 - L5
No ratings yet
Data Warehousing Components - L3 - L4 - L5
26 pages
Basic Charts and Multidimensional Visualization
No ratings yet
Basic Charts and Multidimensional Visualization
33 pages
DW
No ratings yet
DW
29 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Mysql PDF
No ratings yet
Mysql PDF
188 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Datawarehouse Tools
No ratings yet
Datawarehouse Tools
8 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Hive Workshop Practical
No ratings yet
Hive Workshop Practical
29 pages
Relational Database Management System
No ratings yet
Relational Database Management System
5 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Informatica
No ratings yet
Informatica
7 pages
SnowFlake Course Brochure FINAL
No ratings yet
SnowFlake Course Brochure FINAL
7 pages
File Layout Example
No ratings yet
File Layout Example
4 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
The Data Warehousing Development Lifecycle
100% (1)
The Data Warehousing Development Lifecycle
5 pages
Data Warehouse Development Approach
No ratings yet
Data Warehouse Development Approach
25 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Distributed Database Design
No ratings yet
Distributed Database Design
52 pages
Data Science Training in Hyderabad
No ratings yet
Data Science Training in Hyderabad
7 pages
Big Data and Data Warehouse
No ratings yet
Big Data and Data Warehouse
19 pages
AWS Simple-Icons v2.0
No ratings yet
AWS Simple-Icons v2.0
10 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Insurance DataWare House Design Vechiles
No ratings yet
Insurance DataWare House Design Vechiles
2 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
Big Data
No ratings yet
Big Data
6 pages
Advanced Databases
No ratings yet
Advanced Databases
2 pages
Query Optimization
No ratings yet
Query Optimization
9 pages
SQL Cheat Sheet - 1557131235
No ratings yet
SQL Cheat Sheet - 1557131235
12 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
English Vocabulary Pocket Guide PDF
No ratings yet
English Vocabulary Pocket Guide PDF
38 pages
DVS Real Time Interview Questions
No ratings yet
DVS Real Time Interview Questions
12 pages
Building Java Programs: Nested Loops, Figures and Constants
No ratings yet
Building Java Programs: Nested Loops, Figures and Constants
46 pages
Deploying Big Data Analytics Applications To The Cloud: Roadmap For Success
No ratings yet
Deploying Big Data Analytics Applications To The Cloud: Roadmap For Success
21 pages
Irjet V6i162 PDF
No ratings yet
Irjet V6i162 PDF
7 pages
Top 30 Hibernate Interview Questions
No ratings yet
Top 30 Hibernate Interview Questions
7 pages
BK - Hadoop High Availability
No ratings yet
BK - Hadoop High Availability
84 pages
Router 3G - 4G Seneca Z-Pass 2
No ratings yet
Router 3G - 4G Seneca Z-Pass 2
2 pages
4 TH
No ratings yet
4 TH
12 pages
REGIONAL-MEMORANDUM-NO.-108-S.-2025-2025-REGIONAL-SCHOOLS-PRESS-CONFERENCE-RSPC-1
No ratings yet
REGIONAL-MEMORANDUM-NO.-108-S.-2025-2025-REGIONAL-SCHOOLS-PRESS-CONFERENCE-RSPC-1
5 pages
Contoh Artikel Bahasa Inggris Tentang Teknologi Beserta Artinya
No ratings yet
Contoh Artikel Bahasa Inggris Tentang Teknologi Beserta Artinya
5 pages
Course Award Faysal
No ratings yet
Course Award Faysal
10 pages
Week 2 Java
No ratings yet
Week 2 Java
28 pages
2021-MAT-Grade 10-June Examination - Paper 2
No ratings yet
2021-MAT-Grade 10-June Examination - Paper 2
6 pages
Palo Alto Networks Vs Fortinet
100% (1)
Palo Alto Networks Vs Fortinet
2 pages
Roxtec Transit Designer™: Online Tool For Easy Design of Cable and Pipe Transits
No ratings yet
Roxtec Transit Designer™: Online Tool For Easy Design of Cable and Pipe Transits
2 pages
Grade 8 HISTORY OF THE INTERNET
No ratings yet
Grade 8 HISTORY OF THE INTERNET
6 pages
FZK Documents
No ratings yet
FZK Documents
5 pages
Fi - Enterprise Structure
No ratings yet
Fi - Enterprise Structure
42 pages
Bill Limitation
No ratings yet
Bill Limitation
1 page
KPMG Taiwan Business Presentation
No ratings yet
KPMG Taiwan Business Presentation
29 pages
Service Design - The Case of Mcdonald's - by Karling Ho - UX Pla
No ratings yet
Service Design - The Case of Mcdonald's - by Karling Ho - UX Pla
14 pages
PDF_GO_P4 Series User Manual for Borche
No ratings yet
PDF_GO_P4 Series User Manual for Borche
50 pages
(FREE) Driver Booster 10.6 Pro License Key 2023
No ratings yet
(FREE) Driver Booster 10.6 Pro License Key 2023
2 pages
07ProcessadorMIPS
No ratings yet
07ProcessadorMIPS
10 pages
Jiji Uganda
No ratings yet
Jiji Uganda
5 pages
2020 Full-Custom Monte
No ratings yet
2020 Full-Custom Monte
86 pages
Brute Force
No ratings yet
Brute Force
2 pages
120cs0142 Dbms Assignment 6
No ratings yet
120cs0142 Dbms Assignment 6
5 pages
Devops assignment@CSE SYNDICATE
No ratings yet
Devops assignment@CSE SYNDICATE
10 pages
Relational Database Design by ER - To-Relational Mapping
No ratings yet
Relational Database Design by ER - To-Relational Mapping
16 pages
DS Quiz
No ratings yet
DS Quiz
4 pages
EPLAN Print Job 365
No ratings yet
EPLAN Print Job 365
2 pages
Essential Software Architecture 2nd Edition by Ian Gorton ISBN 3642191762 9783642191763 - The 2025 ebook edition is available with updated content
100% (19)
Essential Software Architecture 2nd Edition by Ian Gorton ISBN 3642191762 9783642191763 - The 2025 ebook edition is available with updated content
91 pages
Kinetix 350 Single-Axis Ethernet/Ip Servo Drives: User Manual
No ratings yet
Kinetix 350 Single-Axis Ethernet/Ip Servo Drives: User Manual
156 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Module 1 - Introduction To Big Data

Uploaded by

Module 1 - Introduction To Big Data

Uploaded by

Big Data and Hadoop

Introduction to Big Data and

ᗍ Understand what is Big Data?

NASA Centre for Climate Simulation (NCSS) stores

ᗍ Big data is the term for a collection of data sets

ᗍ The challenges include capture, curation, storage,

Data veracity is the degree to which data is accurate,

Unstructured data is essentially everything else.

Typical machine-generated unstructured data includes:

Variety Photo Web Video Audio

ᗍ 2,500 exabytes of new information in 2012 with internet as primary driver

It is an Open-source Data Management with scale-out storage and distributed

Data Data Data

Real Time System Business

It is an Open-source Data Management with scale-out storage and distributed

Scalabl Characteristic Economical

ᗍ Entire infrastructure is divided into 2 parts - storage and processing

Resourc Node Node Node Node

HDFS Admin Node Data Data Data Data

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.