Unit II Hadoop and Map Reduce Overview
Unit II Hadoop and Map Reduce Overview
• It all started with two people, Mike Cafarella and Doug Cutting.
• who were in the process of building a search engine system that can
index 1 billion pages.
• They estimated that such a system will cost around half a million
dollars in hardware, with a monthly running cost of $30,000, which is
quite expensive.
• They came across a paper, published in 2003, that described the
architecture of Google’s distributed file system, called GFS.
• Later in 2004, Google published one more paper that introduced
MapReduce to the world.
• Finally, these two papers led to the foundation of the framework
called “Hadoop“.
What is Hadoop ?
• Hadoop is an open source framework from Apache and is used to store
process and analyze data which are very huge in volume.
• Hadoop is written in Java and is not OLAP (online analytical processing).
• It is used for batch/offline processing.
• It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many
more.
• Moreover it can be scaled up just by adding nodes in the cluster.
Now, you must have got an idea why Big Data is
a problem statement and how Hadoop solves it.
• In HDFS, there is no pre-dumping schema validation. It also follows write once and read many models.
Due to this, you can just write any kind of data once and you can read it multiple times for finding
insights.
The third challenge was about processing the
data faster
• we move the processing unit to data instead of moving data to the processing unit.
• It means that instead of moving data from different nodes to a single master node for processing, the processing
logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel.
• Finally, all of the intermediary output produced by each node is merged together and the final response is sent back
to the client.
Hadoop Architecture
• NameNode meta data associated with two files.
• copyFromLocal (or) put: To copy files/folders from local file system to hdfs
store. This is the most important command. Local filesystem means the files
present on the OS.
$hadoop fs –put filename(which you want to put) /path
$hadoop fs –copyFromLocal filename(which you want to put) /path
• copyToLocal (or) get: To copy files/folders from hdfs store to
local file system.
$hadoop fs –get /file(path)
$hadoop fs –copyToLocal /file(path)