0% found this document useful (0 votes)
3 views

Ch6 Architectural Design v1

The document provides an overview of Hadoop, an open-source platform for distributed storage and processing of large datasets. It discusses the Hadoop ecosystem, including components like HDFS, YARN, and MapReduce, as well as the differences between Hadoop and SQL databases. The document emphasizes Hadoop's flexibility and efficiency in handling big data through a schema-on-read approach and parallel task distribution.

Uploaded by

7knzpnkvbg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Ch6 Architectural Design v1

The document provides an overview of Hadoop, an open-source platform for distributed storage and processing of large datasets. It discusses the Hadoop ecosystem, including components like HDFS, YARN, and MapReduce, as well as the differences between Hadoop and SQL databases. The document emphasizes Hadoop's flexibility and efficiency in handling big data through a schema-on-read approach and parallel task distribution.

Uploaded by

7knzpnkvbg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Big Data Analytics

14014305-3

Slides are utilized from ISE:4172 Big Data Analytics by Stephen Baek,
University of Iowa, with appreciation for their educational contribution.
Overview of Hadoop
What is Hadoop and why is it useful?
Last Time...

◉ The 3V’s of Big Data


◉ The first V (Volume) is especially problematic.
◉ What can we do if we have a really large set of
data?

Distributed storage system that distributes the


data to multiple machines/computers!
Local vs. Distributed
Local vs. Distributed

◉ Local Machine:
○ Uses own computational resources
◉ Distributed System:
○ Utilizes resources across network
◉ Vertical Scaling:
○ Adding to single machine, expensive
◉ Horizontal Scaling:
○ Adding computers via network, cost-effective
What is Hadoop

◉ An open source software platform for


distributed storage and distributed processing
of very large data sets on computer clusters
built from commodity hardware.
Hadoop History
Source: Doug Cutting Twitter

◉ Google File System (GFS)


and MapReduce papers
in 2003, 2004
◉ Yahoo! project <Nutch>,
an open source search
engine.
◉ Doug Cutting and Tom
White in 2006

7
Hadoop Ecosystem
Hadoop Ecosystem

• Hadoop Distributed File System (HDFS)


• Distributes large datasets across multiple servers
• Ensures fault tolerance through data replication
• Yet Another Resource Negotiator (YARN)
• Manages computing resources within Hadoop clusters
• Allocates resources efficiently for executing tasks
• MapReduce
• Computational model for distributed processing in Hadoop
• Utilizes mappers and reducers to process data in parallel
across the cluster
Hadoop Ecosystem

• Pig and Hive


• High-level scripting languages for Hadoop and
MapReduce
• Simplify data processing tasks for users not proficient in
lower-level languages
• Apache Ambari
• Administrative interface for overseeing Hadoop clusters
• Provides management and monitoring functionalities for
efficient cluster operation
Hadoop Ecosystem

• Apache Mesos
• Manages computer clusters similarly to YARN
• Handles task scheduling and resource management within the
cluster
• Apache Spark
• Fast and widely adopted technology within the ecosystem
• Offers significant performance improvements over
MapReduce
• Supports multiple programming languages, including Scala,
Java, and Python
Motivation: Project Management

👦 👩
John Sarah

👧 Tricia

👳 👨
(Project Manager)

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A

👦 👩
Project A ● Sarah: B Project B
● Sanjay: C
● Bob: D
John Sarah

👧 Tricia

👳 👨
(Project Manager)

Project C Project D

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A

👦 👩
Project A ● Sarah: B Project B
● Sanjay: C
● Bob: D
John Sarah

👧 Tricia

👳 😖
(Project Manager)

Project C Project D

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A

👦 👩
Project A (D) ● Sarah: B Project B
● Sanjay: C
● Bob: D
John Sarah

👧 Tricia

👳 😖
(Project Manager)

Project C Project D

Sanjay Bob
Motivation: Project Management
“Metadata”
● John: A (D)

👦 👩
Project A (D) ● Sarah: B (C) Project B (C)
● Sanjay: C (A)
● Bob: D (B)
John Sarah

👧 Tricia

👳 👨
(Project Manager)

Project C (A) Project D (B)

Sanjay Bob
Hadoop Master/Slave Architecture
“Metadata”
● Slave 1: A (D)

💻 💻
Project A (D) ● Slave 2: B (C) Project B (C)
● Slave 3: C (A)
● Slave 4: D (B)
Slave Node Slave Node

💻
Master Node

💻 Project C (A) Project D (B) 💻


Slave Node Slave Node
Hadoop vs SQL

Hadoop (Schema on Read) SQL (Schema on Write)


Hadoop vs SQL

• SQL databases use a "Schema-on-write" architecture.


• Requires predefined schema during data writing.
• Schema defines structure and data types.
• Data must conform to schema or migration is rejected.
• Hadoop utilizes a "Schema-on-read" approach.
• Allows data to be brought in without predefined schema.
• Schema applied during data reading through code execution.
• Hadoop's flexibility proves powerful for handling
massive data volumes.
Hadoop vs SQL
Student
Student ID
Name
Address Grade
Phone Student ID
Email Course ID
Grade
Attempt

Course
Course ID
Title Room
Instructor Room No.
Room No. Capacity
Computers (Y/N)
Multimedia (Y/N)

Hadoop (Compressed Files) SQL (Logical Forms)


Hadoop vs SQL
Student
Student ID
Name
Address Grade
Phone Student ID
Email Course ID
Grade
Attempt

Course
Course ID
Title Room
Instructor Room No.
Room No. Capacity
Computers (Y/N)
Multimedia (Y/N)

Hadoop (Compressed Files) SQL (Logical Forms)


Hadoop vs SQL

Jane Doe 🔍 Jan Feb Mar Apr

...

Hadoop (MapReduce) SQL (Relational Search)


Hadoop vs SQL

• SQL databases:
• Organized in a logical form with interrelated tables and
compatible keys.
• Hadoop:
• Data stored in compressed files within the Hadoop
Distributed File System (HDFS).
• Replicated across multiple machines for fault tolerance.
• Master node tracks replicated data locations.
• Power of Hadoop lies in parallel distribution of tasks.
Hadoop vs SQL

Jane Doe 🔍 Jan Feb Mar Apr


...


Hadoop (Return whatever is SQL (Two-phase Commit)
currently available)
Hadoop vs SQL

• SQL: Uses "Two-phase Commit" for consistency.


• Blocks returning incomplete data to user.
• Enforces consistency at both write and access time.
• Hadoop: Returns available data immediately.
• Fills missing portions to provide consistent answer eventually.
• Context of big data favors Hadoop:
• Large volume of potentially dirty data.
• High velocity of data increase.
• Exceptions exist, such as financial transactions, where SQL's
complete consistency is crucial.
• In most big data scenarios, Hadoop is preferred.
Further Reading

◉ Wikipedia: Apache Hadoop


https://en.wikipedia.org/wiki/Apache_Hadoop
◉ SAS, What is Hadoop?
https://www.sas.com/en_us/insights/big-
data/hadoop.html
◉ Hadoop Documentation - “HDFS Architecture”
https://hadoop.apache.org/docs/stable/hadoop-
project-dist/hadoop-hdfs/HdfsDesign.html

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy