Haoop Architecture
Haoop Architecture
Haoop Architecture
Contents
• Distributed System
• DFS
• Hadoop
• Why its is needed?
• Issues
• Mutate / lease
Operating systems
• Operating system - Software that supervises
and controls tasks on a computer. Individual
OS:
– Batch processing jobs are collected, placed in a
queue, no interaction with job during processing
– Time shared computing resources are provided
to different users, interaction with program during
execution
– RT systems fast response, can be interrupted
Distributed Systems
• Consists of a number of computers that are connected and
managed so that they automatically share the job processing
load among the constituent computers.
• A distributed operating system is one that appears to its users as
a traditional uniprocessor system, even though it is actually
composed of multiple processors.
• It gives a single system view to its users and provides a single
service.
• Users are transparent to location of files. It provides a virtual
computing env.
Eg The Internet, ATM banking networks, mobile computing
networks, Global Positioning Systems and Air Traffic Control
DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT
COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE
COHERENT SYSTEM
Network Operating System
• In a network operating system the users are aware
of the existence of multiple computers.
• The operating system of individual computers must
have facilities to have communication and
functionality.
• Each machine runs its own OS and has its own user.
• Remote login and file access
• Less transparent but more independency
Distributed OS Networked OS
DFS
• Resource sharing is the motivation behind distributed
Systems. To share files file system
• File System is responsible for the organization, storage,
retrieval, naming, sharing, and protection of files.
• The file system is responsible for controlling access to
the data and for performing low-level operations such as
buffering frequently used data and issuing disk I/O
requests
• The goal is to allow users of physically distributed
computers to share data and storage resources by
using a common file system.
Hadoop
What is Hadoop?
It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to
process it
Apache Software Foundation Project
Open source
Amazon’s EC2
alpha (0.18) release available for download
Hadoop Includes
HDFS a distributed filesystem
Map/Reduce HDFS implements this programming model. It
is an offline computing engine
Concept
Moving computation is more efficient than moving large
data
• Data intensive applications with Petabytes of data.
• Web pages - 20+ billion web pages x 20KB = 400+
terabytes
– One computer can read 30-35 MB/sec from disk
~four months to read the web
– same problem with 1000 machines, < 3 hours
• Difficulty with a large number of machines
– communication and coordination
– recovering from machine failure
– status reporting
– debugging
– optimization
– locality
FACTS
Single-thread performance doesn’t matter
We have large problems and total throughput/price more
important than peak performance
Stuff Breaks – more reliability
• If you have one server, it may stay up three years (1,000 days)
• If you have 10,000 servers, expect to lose ten a day
“Ultra-reliable” hardware doesn’t really help
At large scales, super-fancy reliable hardware still fails, albeit
less often
– software still needs to be fault-tolerant
– commodity machines without fancy hardware give better
perf/price
Types of Metadata:
List of files, file and chunk namespaces; list of
blocks, location of replicas; file attributes etc.
DFS SLAVES or DATA NODES
• Serve read/write requests from clients
• Perform replication tasks upon instruction by
namenode
Data nodes act as:
1) A Block Server
– Stores data in the local file system
– Stores metadata of a block (e.g. CRC)
– Serves data and metadata to Clients
2) Block Report: Periodically sends a report of all
existing blocks to the NameNode
3) Periodically sends heartbeat to NameNode (detect
node failures)
4) Facilitates Pipelining of Data (to other specified
DataNodes)
• Map/Reduce Master “Jobtracker”
– Accepts MR jobs submitted by users
– Assigns Map and Reduce tasks to Tasktrackers
– Monitors task and tasktracker status, re
executes tasks upon failure
• Map/Reduce Slaves “Tasktrackers”
– Run Map and Reduce tasks upon instruction
from the Jobtracker
– Manage storage and transmission of
intermediate output.
SECONDARY NAME NODE