0% found this document useful (0 votes)

7 views37 pages

Hadoop BigData Testing Overview

Big Data refers to data that exceeds the storage and processing capabilities of traditional databases, characterized by its volume, velocity, and variety. Hadoop is a Java-based framework designed for processing large datasets in a distributed computing environment, utilizing components such as HDFS for storage and MapReduce for data processing. The document outlines the architecture, challenges, and operational modes of Hadoop, as well as examples of data storage and processing using MapReduce.

Uploaded by

srinathp709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views37 pages

Hadoop BigData Testing Overview

Uploaded by

srinathp709

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

WWW.PAVANONLINETRAININGS.COM | WWW.PAVANTESTINGTOOLS.

COM
Big Data

•Big Data is the data which is beyond the storage and

processing capacity of a conventional database management systems
is called “Big Data”.
• A Huge amount of data is generated daily in Peta Bytes, and data
generation rate is rapidly increasing.
Sources of Big Data
Big Data Characteristics
• Volume (size of data)
• Velocity (processing speed of data)
• Variety (types of data)

• If any data having above 3 Characteristics called as BIGDATA

Big Data Classification
• Structured data:
• Data which has proper structure and which can be easily stored in
tabular form in any relational databases like MySQL, Oracle etc is
known as structured data. Ex: Employee data
• Semi Structured data:
• Data which has some structure but cannot be saved in a tabular
form in relational databases is known as semi structured data. Ex:
XML Data, email messages etc.
• Unstructured data:
• Data which is not having any structure and cannot be saved in
tabular form of relational databases is known as unstructured
data. Ex: Video files, Audio files, Text file etc.
Challenges of Big Data
• Capturing data
• Storage
• Searching
• Sharing
• Transfer
• Analysis
• Presentation
Google’s Solution
• Google solved this problem using an algorithm called MapReduce.
• This algorithm divides the task into small parts and assigns those parts to many
computers connected over the network, and collects the results to form the final
result dataset.
History of Hadoop
• Google got huge data in 1990’s. They thought how to manage data.
• Came up with idea after 13 years
• 2003- Google File System (GFS) – For storage
• 2004- Map Reduce ( For processing)
• Yahoo
• 2007 – HDFS (Hadoop Distribute File System)
• 2007-08 ( Map Reduce)
• Doug Cutting, who was working at Yahoo! at the time, named it after
his son's toy elephant.
What is Hadoop?
• Hadoop is a free, Java-based programming framework that supports
the processing of large data sets in a distributed computing
environment.
• It is part of the Apache project sponsored by the Apache Software
Foundation.
• Hadoop used commodity(cheaper) hardware and clusters concepts.
• Recommended for huge datasets but not smaller datasets.
Hadoop Architecture
• Hadoop Components:
• MapReduce
• Processing data which is stored on HDFS.
• HDFS
• Storing huge data sets.
HDFS (Hadoop File System)
• HDFS is a specially designed file system for storing huge data set with
cluster of commodity hardware with streaming access pattern.
• Cluster : Group of Systems connected to LAN
• Commodity H/W : Cheap Hardware
• Streaming access pattern : Streaming access pattern means you can write once,
read any number of times but can't change the content of that file once it is kept
in HDFS.
• WORM (Write Once Read Many)
Why Hadoop is specially designed file system?

HDFS File System:

Normal File System: Default Block size is 128 MB.
Default Block size is 4 KB.
Difference between Unix file system and
HDFS?
• In Unix file system default size of the block is of 4KB. Suppose your
file is of 6KB, then you require 2 blocks in Unix each of 4KB. So total
8KB is used, but actually you require 6KB that extra 2KB is wasted.

• In HDFS default size of the block is of 128MB. Suppose your file is of

200MB, then HDFS requires 2 blocks (1 of 128MB + 1 of 73MB). Extra
space of 50MB in last block is not engaged. Instead extra space is
relieved.
HDFS Services(Daemons/Nodes)
• HDFS Have 5 services
• Master services
1. Name node
2. Secondary name node
3. Job Tracker ( Resource Manger)
• Slave Services
4. Data node
5. Task tracker (Node Manger)
• First 3 services are master services and the last 2 services are slave
services.
• Master service can communicate with each other. Similarly true for slave
services.
• In master-slave communication, a NameNode can only communicate with
a DataNode and a JobTracker can only communicate with a TaskTracker.
• If a NameNode fail, SecondaryNameNode is available. However the failure
of JobTracker is the single point of failure in hadoop architecture.
General Use case
One day a farmer want to store his rice packets into a godown. We
went to godown and contacted manager. Manager asked what type of
rice packets you have. He said 50 packs of 30rs, 50 of 40rs, 100 of 50rs.
He noted down on a bill paper he called his PA to store these 30rs paks
in 1st room, 40rs pacs in 3rd room, 50rs pacs in 7th room.
Storing/Writing data in HDFS
Storing data into HDFS
• Client want to store file.txt 200Mb into HDFS
• File.txt is divided into 64mb of sub files. These are called as Input Splits.
• Client send these to NameNode. Namenode reads the metadata , and sends
available datanodes to client.
• Client send each input split to datanodes.
• A.txt file stores into datanode-1
• And HDFS also maintains the replica of data. by default replica is 3.
• So a.txt is stored in datanodes 5,9 as well
• Acknowledgment is sent to 9 to 5, 5 to 1, 1 to client.
• And the response is sent to client.
• For every storage operation datanode send Block Report and Heartbeat to
namenode.
• Datanode sent heartbeat to namenode for every short period of time.
• If any datanode fails, namenode copies that data to another datanode
MapReduce (Retrieving Data)
• If you want process data we write one program(10KB)
• Processing(it is called as Map)
• client submits commands/program to JobTracker for applying job on a.txt
• JobTracker asks Namenode
• namenode replies metadata to JobTracker
• JobTracker knows on which blocks data is stored in Datanodes
• A.txt stored in 1,3,9
• Using metadata Job tracker contacts Tasktracker near to
JobTracker(datanode-1)
• JobTracker assigns this task to Tasktracker
• Tasktracker apply this job on a.txt. this is called “MAP”
• Similarly on b.txt, c.txt, d.txt Tasktracker runs program, called MAP
• a,b,c,d txt files called Input Splits
• No. of Input Splits = No. of Mappers
• If any Tasktracker is not able to do the job, it informs JobTracker.
• JobTracker assigns to nearest another Tasktracker.
• for every 3secs Tasktracker send heartbeat to JobTracker.
• if 10sec also if JobTracker doesn’t get heartbeat, it thinks Tasktracker
is dead or slow .
• Tasktracker may slow if it running more jobs ex. 100 jobs.
• JobTracker knows the how many jobs running by Tasktracker .
• if any Tasktracker is busy , JobTracker chooses another Tasktracker .
• After applying program, a.txt returns 10kb of data, b.txt 10kb, c 10kb,
d 4kb.
• After getting outputs Reducer combines all these output data to file.
• No of Reducers = No. of Output files .
• The Output file may save in local data node, or some other node.
• The Datanode sends heartbeat to Namenode where output file is
saved.
• NameNode save this metadata .
• Client contacts NameNode for Output file location .
MapReduce Examples
• Steps in MapReduce:
1. Main job is splitting into sub-jobs.
2. Map this sub-jobs to different CPUs/Processor
3.Collect the output from the different processors or Mappers.
4.Reducing the output to produce the final result.
MapReduce Example
• Mapping of MapReduce
Sum of 1000 numbers.
We have 10 members with us.
1. Input Splitting – Divide 1000 numbers in 10 peoples.
2. Map Method – Each member perform addition of 100 numbers.
3. Reduce Method – After addition performed by all 10 members, Collect
this addition to single person and again perform addition of this 10
collected numbers and we get final output.
Word Count example in Map-Reduce format
Hadoop Distributions
Hadoop Ecosystem
• The Hadoop ecosystem contains different sub-projects (tools) such as
Sqoop, Pig, and Hive that are used to help Hadoop modules.
Hadoop Running Modes
• Standalone Mode
• Default mode of Hadoop
• HDFS is not utilized in this mode.
• Local file system is used for input and output
• Used for debugging purpose
• No Custom Configuration is required in 3 hadoop (mapred-
site.xml,core-site.xml, hdfs-site.xml) files.
• Standalone mode is much faster than Pseudo-distributed mode.
• Pseudo Distributed Mode (Single Node Cluster)
• Configuration is required in given 3 files for this mode
• Replication factory is one for HDFS.
• Here one node will be used as Master Node / Data Node / Job Tracker / Task
Tracker
• Used for Real Code to test in HDFS.
• Pseudo distributed cluster is a cluster where all daemons are
running on one node itself.
• Fully distributed mode (or multiple node cluster)
• This is a Production Phase
• Data are used and distributed across many nodes.
• Different Nodes will be used as Master Node / Data Node / Job
Tracker / Task Tracker
Accessing HDFS
• We can access HDFS in 3 different ways.
1. CLI mode
2. GUI mode using thick clients
3. Browser
Cloudera Quick VM Download
• http://www.cloudera.com/downloads/quickstart_vms/5-8.html
Local File System Commands on Linux
HDFS Commands on Linux

Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Bell - Luxor Temple and The Cult of The Royal Ka
0% (1)
Bell - Luxor Temple and The Cult of The Royal Ka
45 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Big Data Notes (2)
No ratings yet
Big Data Notes (2)
191 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
Module 02 - Learners Guide
No ratings yet
Module 02 - Learners Guide
82 pages
BDP 2023 03
No ratings yet
BDP 2023 03
59 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
BDA-3
No ratings yet
BDA-3
70 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Big Data
No ratings yet
Big Data
47 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Big data aktu unit 2
No ratings yet
Big data aktu unit 2
127 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
02-Hadoop
No ratings yet
02-Hadoop
117 pages
UNIT 5
No ratings yet
UNIT 5
101 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
4
No ratings yet
4
53 pages
Big Data
No ratings yet
Big Data
67 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
5.Apache Hadoop Updated
No ratings yet
5.Apache Hadoop Updated
57 pages
Module 2.1
No ratings yet
Module 2.1
21 pages
Module II
No ratings yet
Module II
46 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
24 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
Introduction to Hadoop- chapter-2
No ratings yet
Introduction to Hadoop- chapter-2
59 pages
Big Data
No ratings yet
Big Data
51 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Hadoop
No ratings yet
Hadoop
31 pages
The Solution For Big Data Hadoop
No ratings yet
The Solution For Big Data Hadoop
27 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Lesson Based On A Movie-Ratatouille
50% (2)
Lesson Based On A Movie-Ratatouille
6 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Grade 10 Dhul Qarnayn
No ratings yet
Grade 10 Dhul Qarnayn
19 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Dell Avamar For SQL Server 19.9
No ratings yet
Dell Avamar For SQL Server 19.9
149 pages
Hadoop - The Final Product
100% (2)
Hadoop - The Final Product
42 pages
Explorations SG2019
No ratings yet
Explorations SG2019
110 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Operation Systems, by Gary Nutt: Third Edition
No ratings yet
Operation Systems, by Gary Nutt: Third Edition
14 pages
Hadoop
No ratings yet
Hadoop
5 pages
Updated Detailed Results After Dope 27th National Federation Senior Athletics Competition 2024
No ratings yet
Updated Detailed Results After Dope 27th National Federation Senior Athletics Competition 2024
44 pages
Exemplar Find Your IP
No ratings yet
Exemplar Find Your IP
12 pages
Polish Literature: Literatura Polska
100% (2)
Polish Literature: Literatura Polska
53 pages
Vedic Origin and Classification of Script
100% (4)
Vedic Origin and Classification of Script
30 pages
ICT Lab 4 MOHSIN ALI
No ratings yet
ICT Lab 4 MOHSIN ALI
8 pages
CV Khawla - Alzoubi Final Version 2016
No ratings yet
CV Khawla - Alzoubi Final Version 2016
9 pages
SBG257 03 - en GB
No ratings yet
SBG257 03 - en GB
5 pages
Life of ST
No ratings yet
Life of ST
14 pages
Day 2
No ratings yet
Day 2
20 pages
Aptis Writing Test 2 - Pretest Part 2 N 3
No ratings yet
Aptis Writing Test 2 - Pretest Part 2 N 3
5 pages
PTE Experience and Tips
93% (14)
PTE Experience and Tips
7 pages
Korean Numbers: From Basic Counting To Calculations
No ratings yet
Korean Numbers: From Basic Counting To Calculations
34 pages
Module 1
No ratings yet
Module 1
5 pages
DLP Math IV
No ratings yet
DLP Math IV
6 pages
Abhinavagupta's Use of The Analogyof Reflection
No ratings yet
Abhinavagupta's Use of The Analogyof Reflection
17 pages
Module 04 Written Assignment - Positive Teaching Strategies
No ratings yet
Module 04 Written Assignment - Positive Teaching Strategies
3 pages
Corrigendum: Government of India, Ministry of Railways Railway Recruitment Boards
No ratings yet
Corrigendum: Government of India, Ministry of Railways Railway Recruitment Boards
3 pages
AFRICAN AMERICAN VERNACULAR ENGLISH IS NOT STANDARD ENGLISH WITH MISTAKES by Geoffrey K. Pullum
No ratings yet
AFRICAN AMERICAN VERNACULAR ENGLISH IS NOT STANDARD ENGLISH WITH MISTAKES by Geoffrey K. Pullum
20 pages
Meaning of Educational Technolog:: Activity No. 1
No ratings yet
Meaning of Educational Technolog:: Activity No. 1
11 pages
General Paper 8001 Syllabus
No ratings yet
General Paper 8001 Syllabus
17 pages
Skeletal System Relationships Rubric
No ratings yet
Skeletal System Relationships Rubric
2 pages
MySQL Certification Guide.1Z0 873 Sample
No ratings yet
MySQL Certification Guide.1Z0 873 Sample
4 pages
2 Who Is Yhuh - Who Did They See and Hear
No ratings yet
2 Who Is Yhuh - Who Did They See and Hear
2 pages
SAT Subject Tests: Math Level 2
No ratings yet
SAT Subject Tests: Math Level 2
2 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Hadoop BigData Testing Overview

Uploaded by

Hadoop BigData Testing Overview

Uploaded by

WWW.PAVANONLINETRAININGS.COM | WWW.PAVANTESTINGTOOLS.

•Big Data is the data which is beyond the storage and

• If any data having above 3 Characteristics called as BIGDATA

HDFS File System:

• In HDFS default size of the block is of 128MB. Suppose your file is of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.