0% found this document useful (0 votes)

3 views

Ch6 Architectural Design v1

The document provides an overview of Hadoop, an open-source platform for distributed storage and processing of large datasets. It discusses the Hadoop ecosystem, including components like HDFS, YARN, and MapReduce, as well as the differences between Hadoop and SQL databases. The document emphasizes Hadoop's flexibility and efficiency in handling big data through a schema-on-read approach and parallel task distribution.

Uploaded by

7knzpnkvbg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Ch6 Architectural Design v1

Uploaded by

7knzpnkvbg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Big Data Analytics

14014305-3

Slides are utilized from ISE:4172 Big Data Analytics by Stephen Baek,
University of Iowa, with appreciation for their educational contribution.
Overview of Hadoop
What is Hadoop and why is it useful?
Last Time...

◉ The 3V’s of Big Data

◉ The first V (Volume) is especially problematic.
◉ What can we do if we have a really large set of
data?

Distributed storage system that distributes the

data to multiple machines/computers!
Local vs. Distributed
Local vs. Distributed

◉ Local Machine:
○ Uses own computational resources
◉ Distributed System:
○ Utilizes resources across network
◉ Vertical Scaling:
○ Adding to single machine, expensive
◉ Horizontal Scaling:
○ Adding computers via network, cost-effective
What is Hadoop

◉ An open source software platform for

distributed storage and distributed processing
of very large data sets on computer clusters
built from commodity hardware.
Hadoop History
Source: Doug Cutting Twitter

◉ Google File System (GFS)

and MapReduce papers
in 2003, 2004
◉ Yahoo! project <Nutch>,
an open source search
engine.
◉ Doug Cutting and Tom
White in 2006

7
Hadoop Ecosystem
Hadoop Ecosystem

• Hadoop Distributed File System (HDFS)

• Distributes large datasets across multiple servers
• Ensures fault tolerance through data replication
• Yet Another Resource Negotiator (YARN)
• Manages computing resources within Hadoop clusters
• Allocates resources efficiently for executing tasks
• MapReduce
• Computational model for distributed processing in Hadoop
• Utilizes mappers and reducers to process data in parallel
across the cluster
Hadoop Ecosystem

• Pig and Hive

• High-level scripting languages for Hadoop and
MapReduce
• Simplify data processing tasks for users not proficient in
lower-level languages
• Apache Ambari
• Administrative interface for overseeing Hadoop clusters
• Provides management and monitoring functionalities for
efficient cluster operation
Hadoop Ecosystem

• Apache Mesos
• Manages computer clusters similarly to YARN
• Handles task scheduling and resource management within the
cluster
• Apache Spark
• Fast and widely adopted technology within the ecosystem
• Offers significant performance improvements over
MapReduce
• Supports multiple programming languages, including Scala,
Java, and Python
Motivation: Project Management

👦 👩
John Sarah

👧 Tricia

👳 👨
(Project Manager)

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A

👦 👩
Project A ● Sarah: B Project B
● Sanjay: C
● Bob: D
John Sarah

👧 Tricia

👳 👨
(Project Manager)

Project C Project D

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A

👦 👩
Project A ● Sarah: B Project B
● Sanjay: C
● Bob: D
John Sarah

👧 Tricia

👳 😖
(Project Manager)

Project C Project D

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A

👦 👩
Project A (D) ● Sarah: B Project B
● Sanjay: C
● Bob: D
John Sarah

👧 Tricia

👳 😖
(Project Manager)

Project C Project D

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A (D)

👦 👩
Project A (D) ● Sarah: B (C) Project B (C)
● Sanjay: C (A)
● Bob: D (B)
John Sarah

👧 Tricia

👳 👨
(Project Manager)

Project C (A) Project D (B)

Sanjay Bob
Hadoop Master/Slave Architecture
“Metadata”
● Slave 1: A (D)

💻 💻
Project A (D) ● Slave 2: B (C) Project B (C)
● Slave 3: C (A)
● Slave 4: D (B)
Slave Node Slave Node

💻
Master Node

💻 Project C (A) Project D (B) 💻

Slave Node Slave Node
Hadoop vs SQL

Hadoop (Schema on Read) SQL (Schema on Write)

Hadoop vs SQL

• SQL databases use a "Schema-on-write" architecture.

• Requires predefined schema during data writing.
• Schema defines structure and data types.
• Data must conform to schema or migration is rejected.
• Hadoop utilizes a "Schema-on-read" approach.
• Allows data to be brought in without predefined schema.
• Schema applied during data reading through code execution.
• Hadoop's flexibility proves powerful for handling
massive data volumes.
Hadoop vs SQL
Student
Student ID
Name
Address Grade
Phone Student ID
Email Course ID
Grade
Attempt

Course
Course ID
Title Room
Instructor Room No.
Room No. Capacity
Computers (Y/N)
Multimedia (Y/N)

Hadoop (Compressed Files) SQL (Logical Forms)

Hadoop vs SQL
Student
Student ID
Name
Address Grade
Phone Student ID
Email Course ID
Grade
Attempt

Course
Course ID
Title Room
Instructor Room No.
Room No. Capacity
Computers (Y/N)
Multimedia (Y/N)

Hadoop (Compressed Files) SQL (Logical Forms)

Hadoop vs SQL

Jane Doe 🔍 Jan Feb Mar Apr

...

Hadoop (MapReduce) SQL (Relational Search)

Hadoop vs SQL

• SQL databases:
• Organized in a logical form with interrelated tables and
compatible keys.
• Hadoop:
• Data stored in compressed files within the Hadoop
Distributed File System (HDFS).
• Replicated across multiple machines for fault tolerance.
• Master node tracks replicated data locations.
• Power of Hadoop lies in parallel distribution of tasks.
Hadoop vs SQL

Jane Doe 🔍 Jan Feb Mar Apr

⏳
...

⏳
Hadoop (Return whatever is SQL (Two-phase Commit)
currently available)
Hadoop vs SQL

• SQL: Uses "Two-phase Commit" for consistency.

• Blocks returning incomplete data to user.
• Enforces consistency at both write and access time.
• Hadoop: Returns available data immediately.
• Fills missing portions to provide consistent answer eventually.
• Context of big data favors Hadoop:
• Large volume of potentially dirty data.
• High velocity of data increase.
• Exceptions exist, such as financial transactions, where SQL's
complete consistency is crucial.
• In most big data scenarios, Hadoop is preferred.
Further Reading

◉ Wikipedia: Apache Hadoop

https://en.wikipedia.org/wiki/Apache_Hadoop
◉ SAS, What is Hadoop?
https://www.sas.com/en_us/insights/big-
data/hadoop.html
◉ Hadoop Documentation - “HDFS Architecture”
https://hadoop.apache.org/docs/stable/hadoop-
project-dist/hadoop-hdfs/HdfsDesign.html

HADOOP
No ratings yet
HADOOP
10 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
The Big Data Technology Landscape
No ratings yet
The Big Data Technology Landscape
36 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
BigData-Session1
No ratings yet
BigData-Session1
14 pages
HADOOP NOTES
No ratings yet
HADOOP NOTES
8 pages
Big Data
No ratings yet
Big Data
27 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
UNIT-2 BIG DATA
No ratings yet
UNIT-2 BIG DATA
10 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Module 1 - Introduction To Big Data
100% (1)
Module 1 - Introduction To Big Data
40 pages
Big Data Management
No ratings yet
Big Data Management
38 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
hadoop.pptx
No ratings yet
hadoop.pptx
61 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
Unit 2
No ratings yet
Unit 2
73 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
S - Hadoop Ecosystem
No ratings yet
S - Hadoop Ecosystem
14 pages
Hadoop
No ratings yet
Hadoop
83 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Hadoop V.01
No ratings yet
Hadoop V.01
24 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
I am preparing for a Big Data Analytics university... (1)
No ratings yet
I am preparing for a Big Data Analytics university... (1)
15 pages
biggdata
No ratings yet
biggdata
24 pages
Bda PPT M1 P2 1
No ratings yet
Bda PPT M1 P2 1
19 pages
Hadoopvsspark 180108070838
No ratings yet
Hadoopvsspark 180108070838
17 pages
BDAunit-II
No ratings yet
BDAunit-II
4 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Bda Unit 2
No ratings yet
Bda Unit 2
44 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Bigdata Intro
No ratings yet
Bigdata Intro
76 pages
An Overview of The Hadoop Ecosystem
No ratings yet
An Overview of The Hadoop Ecosystem
9 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
INTRO hadoop-ecosystem
No ratings yet
INTRO hadoop-ecosystem
6 pages
Hive
No ratings yet
Hive
12 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
Gradle Essentials: Master the fundamentals of Gradle using real-world projects with this quick and easy-to-read guide
From Everand
Gradle Essentials: Master the fundamentals of Gradle using real-world projects with this quick and easy-to-read guide
Kunal Dabir
No ratings yet
The Book of Dash: Build Dashboards with Python and Plotly
From Everand
The Book of Dash: Build Dashboards with Python and Plotly
Adam Schroeder
No ratings yet
Java: A complete practical solution
From Everand
Java: A complete practical solution
Swati Saxena
No ratings yet
Mastering D3.js
From Everand
Mastering D3.js
Pablo Navarro Castillo
3/5 (1)
Introduction To TMS320C6000 DSP Optimization: Application Report
No ratings yet
Introduction To TMS320C6000 DSP Optimization: Application Report
31 pages
Conectar PPTP Linux On Windows VPN Server
No ratings yet
Conectar PPTP Linux On Windows VPN Server
8 pages
Mikrotik 4 WAN Load Balancing Using PCC Method
No ratings yet
Mikrotik 4 WAN Load Balancing Using PCC Method
4 pages
ATX Power Supplies (PG)
No ratings yet
ATX Power Supplies (PG)
16 pages
ChatLog Cours Socket UDP
No ratings yet
ChatLog Cours Socket UDP
4 pages
Nepal Rastra Bank Pretest: Information Communication Technology
No ratings yet
Nepal Rastra Bank Pretest: Information Communication Technology
13 pages
Datasheet, Vol. 1 - 7th Gen Intel® Core™ Processor U - Y-Platforms PDF
No ratings yet
Datasheet, Vol. 1 - 7th Gen Intel® Core™ Processor U - Y-Platforms PDF
118 pages
IT System Administration and Maintenance CSC2020
No ratings yet
IT System Administration and Maintenance CSC2020
42 pages
Introduction To Operating System: TM 2033 Platform Technology
No ratings yet
Introduction To Operating System: TM 2033 Platform Technology
42 pages
AIR 13.3 Schematic
No ratings yet
AIR 13.3 Schematic
55 pages
Payload e Backdor
No ratings yet
Payload e Backdor
4 pages
HCSP-Presales-Computing V1.0Training Material PDF
No ratings yet
HCSP-Presales-Computing V1.0Training Material PDF
170 pages
Compal La-6755p, La-6757p (Pawgc, Pawgd) 2010-11-10 Rev 1.0 Schematic
100% (1)
Compal La-6755p, La-6757p (Pawgc, Pawgd) 2010-11-10 Rev 1.0 Schematic
51 pages
Creare Partiţii GPT Pentru Uefi: Easeus Partition Master
No ratings yet
Creare Partiţii GPT Pentru Uefi: Easeus Partition Master
4 pages
Data Sheet
No ratings yet
Data Sheet
166 pages
Computer Awareness Questions Collection: List of Commonly Used Computer Abbreviations
No ratings yet
Computer Awareness Questions Collection: List of Commonly Used Computer Abbreviations
106 pages
CN 320: Microprocessor and Microcontroller Systems: Lecture I-Introduction
No ratings yet
CN 320: Microprocessor and Microcontroller Systems: Lecture I-Introduction
35 pages
stm32wl33cb
No ratings yet
stm32wl33cb
91 pages
HAP Howto
No ratings yet
HAP Howto
8 pages
Enterprise Mobile Device Management Using Microsoft Intune and SCCM
No ratings yet
Enterprise Mobile Device Management Using Microsoft Intune and SCCM
3 pages
Project C-1
No ratings yet
Project C-1
16 pages
Cheatsheet
No ratings yet
Cheatsheet
9 pages
Implementation of Pass-Ii of A Linking Loader
No ratings yet
Implementation of Pass-Ii of A Linking Loader
4 pages
Service Technician
No ratings yet
Service Technician
18 pages
Find A Laptop Motherboard Manufacturer PDF
No ratings yet
Find A Laptop Motherboard Manufacturer PDF
5 pages
Oracle 10G Installation: Step I: Check For Following Required Package Versions (Or Later)
No ratings yet
Oracle 10G Installation: Step I: Check For Following Required Package Versions (Or Later)
46 pages
Intellisight User Manual EN
No ratings yet
Intellisight User Manual EN
15 pages
CHAPTER-1.4: Network Standardization IEEE 802 Standards Tcp/Ip Features of TCP/IP
No ratings yet
CHAPTER-1.4: Network Standardization IEEE 802 Standards Tcp/Ip Features of TCP/IP
14 pages
Programmer's Model 8086-80486
No ratings yet
Programmer's Model 8086-80486
20 pages
Introduction to IT System Suggestion by NatiTute
No ratings yet
Introduction to IT System Suggestion by NatiTute
34 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Ch6 Architectural Design v1

Uploaded by

Ch6 Architectural Design v1

Uploaded by

Big Data Analytics

◉ The 3V’s of Big Data

Distributed storage system that distributes the

◉ An open source software platform for

◉ Google File System (GFS)

• Hadoop Distributed File System (HDFS)

• Pig and Hive

Project C (A) Project D (B)

💻 Project C (A) Project D (B) 💻

Hadoop (Schema on Read) SQL (Schema on Write)

• SQL databases use a "Schema-on-write" architecture.

Hadoop (Compressed Files) SQL (Logical Forms)

Hadoop (Compressed Files) SQL (Logical Forms)

Jane Doe 🔍 Jan Feb Mar Apr

Hadoop (MapReduce) SQL (Relational Search)

Jane Doe 🔍 Jan Feb Mar Apr

• SQL: Uses "Two-phase Commit" for consistency.

◉ Wikipedia: Apache Hadoop

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.