Hadoop
Hadoop
Hadoop
Big Data
What is Big data? 2
Big data is a term for data sets that are so large or complex
that traditional data processing applications are inadequate
to deal with them.
Sources of Big Data 3
Transport Data
Big Data isn't just numbers, dates, and strings. Big Data is
also geospatial data, 3D data, audio and video, and
unstructured text, including log files and social media.
Storage
Searching
Sharing
Transfer
Analysis
11
Hadoop
History of Hadoop 12
HADOOP
MapReduce
(Distributed Computation)
HDFS
(Distributed Storage)
YARN
Framework Common
15
HADOOP COMMON:
Common refers to the collection of common utilities and
libraries that support other Hadoop modules.
These libraries provides file system and OS level abstraction
and contains the necessary Java files and scripts required to
start Hadoop.
HADOOP YARN:
Yet Another Resource Negotiator
a resource-management platform responsible for managing
computing resources in clusters and using them for
scheduling of users' applications
HDFS 16
File Blocks
64MB (default), 128MB (recommended) – compare to 4 KB in
UNIX
Behind the scenes, 1 HDFS block is supported by multiple
operating system (OS) blocks
Fits well with replication to provide fault tolerance and
availability
HDFS Block 128 MB
OS Block
..
Advantages of blocks 18
HADOOP
MapReduce
COMPONENTS OF HADOOP 27
• HDFS
• MapReduce
• YARN Framework
• Libraries
A DEFINITION 28
MASTE SLAVE
SLAVE SLAVE
Init - Hadoop divides the input file stored on HDFS into splits (typically of the size of
an HDFS block) and assigns every split to a different mapper, trying to assign every
split to the mapper where the split physically resides
Mapper - Hadoop reads the split of the mapper line by line. Hadoop calls the
method map() of the mapper for every line passing it as the key/value parameters -
the mapper computes its application logic and emits other key/value pairs
Shuffle and sort -Hadoop's partitioner divides the emitted output of the mapper
into partitions, each of those is sent to a different reducer. Hadoop collects all the
different partitions received from the mappers and sort them by key
Reducer -Hadoop reads the aggregated partitions line by line. Hadoop calls the
reduce() method on the reducer for every line of the input - thereducer computes its
application logic and emits other key/value pairs - locally, Hadoop writes the emitted
pairs output (the emitted pairs) to HDFS
32
COMMON JOBS FORMapReduce 33
INDEX
TEXT MINING GRAPHS
BUILDING
RISK
ANALYSIS
WORD COUNT USING MapReduce 34
BENEFITS 35
Simplicity
Scalability
Speed
Recovery
Minimal data
motion
36
JAQL
INTRODUCTION 37
Access and load data from different sources (local file system,
web, twitter, HDFS, Hbase, etc.)
Write data into different places (local file system, HDFS, HBase,
databases, etc.)
40
Jaql environment
Flexibility
Scalability
Physical Transparency
Modularity
JAQL I/O 42
• cd
jaql $BIGINSIGHTS_HOME/jaql/bin
jaqlshell • ./jaqlshell
ADVANTAGES 44
Simplicity
facilitates the development
makes easier the distribution between nodes of the
pr
Map reduce jobs can be directly called
45
HIVE
46
Apache PIG
What is Pig? 49
pig • Cd $PIG_HOME/bin
Twitter Data
Real twitter data
purchased by IBM
Available in JSON (
JavaScript Object
Notation) format
Objective 56
use Jaql core operators to manipulate JSON data found in Twitter feeds.
Filter arrays to remove values
Sort arrays in either ascending or descending sequence
Write data to HDFS
Procedure 57