BigData Unit 2
BigData Unit 2
BigData Unit 2
In the early days, big data required a lot of raw computing power, storage, and
parallelism, which meant that organizations had to spend a lot of money to build
the infrastructure needed to support big data analytics. Given the large price tag,
only the largest Fortune 500 organizations could afford such an infrastructure.
The Birth of MapReduce: The only way to get around this problem was to break
down big data into manageable chunks and run smaller jobs in parallel, using low
cost hardware, where fault tolerance and self-healing would be managed in the
software. This was the primary goal of the Hadoop Distributed File System
(HDFS). And to fully capitalize on big data, MapReduce came on the scene. This
programming paradigm made it possible for massive scalability across hundreds or
thousands of servers in a Hadoop cluster.
YARN Comes on the Scene: The first generation of Hadoop provided affordable
scalability and a flexible data structure, but it was really only the first step in the
journey. Its batch-oriented job processing and consolidated resource management
were limitations that drove the development of Yet Another Resource Negotiator
(YARN). YARN essentially became the architectural center of Hadoop, since it
allowed multiple data processing engines to handle data stored in one platform.
This new, modern data architecture made it possible for Hadoop to become a true
data operating system and platform. YARN separated the data persistence
functions from the different execution models to unify data for multiple workloads.
Hadoop Version 2 provides the foundation for today’s data lake strategy, which
is basically a large object-based storage repository that holds data in its native
format until it is needed. However, using the data lake only as a consolidated data
repository is shortsighted; Hadoop is really meant to be used as an interactive,
multiple workload and operational data platform.
Data Storage:
Hbase:Apache HBase is a NoSQL database built for hosting large tables with
billions of rows and millions of columns on top of Hadoop commodity hardware
machines. Use Apache Hbase when you need random, realtime read/write access to
your Big Data.
Features:
• Strictly consistent reads and writes. In memory operations.
• Easy to use Java API for client access.
• Well integrated with pig, hive and sqoop.
• Is a consistent and partition tolerant system in CAP theorem.
Cassandra:Cassandra is a NoSQL database designed for linear scalability and
high availability. Cassandra is based on key-value model. Developed by Facebook
and known for faster response to queries.
Features:
• Column indexes
• Support for de-normalization
• Materialized views
• Powerful built-in caching.
Data Serialization:
Avro:Apache Avro is a data serialization framework which is language neutral.
Designed for language portability, allowing data to potentially outlive the language
to read and write it.
Thrift:Thrift is a language developed to build interfaces to interact with
technologies built on Hadoop. It is used to define and create services for numerous
languages.
Data Intelligence:
Drill:Apache Drill is a low latency SQL query engine for Hadoop and NoSQL.
Features:
• Agility
• Flexibility
• Familiarilty.
Mahout:Apache Mahout is a scalable machine learning library designed for
building predictive analytics on Big Data. Mahout now has implementations
apache spark for faster in memory computing.
Features:
• Collaborative filtering.
• Classification
• Clustering
• Dimensionality reduction
Data Integration:
Apache Sqoop:Apache Sqoop is a tool designed for bulk data transfers between
relational databases and Hadoop.
Features:
• Import and export to and from HDFS.
• Import and export to and from Hive.
• Import and export to HBase.
Apache Flume:Flume is a distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
Apache Chukwa:Scalable log collector used for monitoring large distributed files
systems.
Features:
• Scales to thousands of nodes.
• Reliable delivery.
• Should be able to store data indefinitely.
The server hosting the NameNode typically doesn't store any user data or perform
any computations for a MapReduce program to lower the workload on the
machine, hence memory & I/O intensive.
DataNode: Each slave machine in your cluster will host a DataNode daemon to
perform the grunt work of the distributed filesystem - reading and writing HDFS
blocks to actual files on the local file system
When you want to read or write a HDFS file, the file is broken into blocks and the
NameNode will tell your client which DataNode each block resides in. Your client
communicates directly with the DataNode daemons to process the local files
corresponding to the blocks.
There is another topic which can be covered under SNN, i.e., fsimage(filesystem
image) file and edits file:
There is only one JobTracker daemon per Hadoop cluster. It's typically run on a
server as a master node of the cluster.
TaskTracker: As with the storage daemons, the computing daemons also follow a
master/slave architecture: the JobTracker is the master overseeing the overall
execution of a MapReduce job and the TaskTracker manage the execution of
individual tasks on each slave node.
Each TaskTracker is responsible for executing the individual tasks that the
JobTracker assigns. Although there is a single TaskTracker per slave node, each
TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in
parallel.
Hadoop stores petabytes of data using the HDFS technology. Using HDFS it is
possible to connect commodity hardware or personal computers, also known as
nodes in Hadoop parlance. These no nodes
des are connected over a cluster on which the
data files are stored in a distributed manner. Using the power of HDFS the whole
cluster and the nodes can be easily accessed ffor or data storage and processing. The
access to data is strictly on a streaming manner using the MapReduce process.
Key features of HDFS:
HDFS is highly resilient since upon failure the workload is immediately
transferred to another node
It provides an extremely goodamount of throughput even for gigantic
volumes of data sets
It is unlike other distributed file systems sinceit is based on write-once-read-
write
many model
It allows high data coherence, removes concurrency control issues and
speeds up data access
HDFS moves computation to the place where data exists instead of the other
way around
Thus, applications are moved closer to the point where data resides which is
much cheaper, faster and improves the overall throughput.
The reasons why HDFS works so well with Big Data:
• HDFS uses the method of MapReduce for access to data which is very fast
• It follows a data coherency model that is simple yet highly robust and scalable
• Compatible with any commodity hardware and operating system
• Achieves economy by distributing data and processing on clusters with parallel
nodes
• Data is always safe as it is automatically saved in multiple locations in a
foolproof way
• It provides a JAVA API and even a C language wrapper on top
It is easily accessible using a web browser making it highly utilitarian.