10th August Morning and Afternoon session Hadoop (1)
10th August Morning and Afternoon session Hadoop (1)
10 machine
4 I/O Channel
Each channel – 100 MB/s
1 machine
4 I/O Channel
Each channel – 100 MB/s
In 2002, Doug Cutting and Mike Cafarella - Apache
Nutch Project - aim at building a web search
engine - crawl & index websites.
10 machine
4 I/O Channel
Each channel – 100 MB/s
1 machine
4 I/O Channel
Each channel – 100 MB/s
WEEK 2 - CSE 3020 – DATA VISUALIZATION
HDFS Architecture
HDFS is a block-structured file system
where each file is divided into blocks of a
pre-determined size and stored across a Name Node:
cluster of one or several machines.
v Master daemon - maintains and manages
v Moving Computation is Cheaper than the Data Nodes
Moving Data
v Records the metadata of all the files stored
in the cluster, e.g. Location of data, Size of
files, permissions etc
• One replica on local node, another replica on a remote rack, Third replica on different
node on the same rack, Additional replicas are randomly placed
• Data placement exposed so that computation can be migrated to data
HDFS Read Architecture:
Name node
HDFS Read Architecture:
v Client will reach out NameNode asking for block metadata
v NameNode will return the list of DataNodes where each block (Block A &
B) are stored
v After that client, will connect to the DN where blocks are stored
v Client starts reading data parallel from the DataNodes (Block A from
DataNode 1 and Block B from DataNode 3)
v Once the client gets all the required file blocks, it will combine these blocks to
form a file
v While serving read request of client, HDFS selects the replica which is closest
to the client - reduces the read latency and the bandwidth consumption
MUTATION ORDER AND LEASES
Dr Vengadeswaran 16
DATA CORRECTNESS
Dr Vengadeswaran 17
• Guarantees
• Checkpoints for incremental writes
• Checksums for records/chunks
• Unique ID for records
• Stale replicas by version number.
Dr Vengadeswaran 18