Unit 1 (1)

Unit 1
INTRODUCTION TO HADOOP
Evolution of Big Data
WHY MOST OF THE
ORGANIZATIONS
NOWADAYS HEADING
TOWARDS BIG DATA ???
Why Big Data?
Example
What is Big Data?
• Big data is a term that refers to extremely large and fast growing
complex data which cannot be stored in the traditional data
management systems.
• However, when suitably evaluated using modern tools,
these massive volumes of data provide organizations
with useful insights that help them improve their
business by making informed decisions.
Types of Data
• As internet continues to grow, an incomprehensible amount of
data is getting generated every second. Number of data
floating around the internet is estimated to reach 163
zettabytes by 2025 which can be in the form of tweets,
messages, images, emails, e-books etc. This data can
be classified into following types:
Structured Data
Unstructured Data
Semi Structed Data
Structured data
• Structured data has certain predefined organizational

properties and is present in structured or tabular
schema, making it easier to analyze and sort.
• Each field is discrete and can be accessed separately or
jointly along with data from other fields which makes
structured data extremely valuable, making it possible
to collect data from various locations in the database
quickly.
Unstructured data
• Unstructured data entails information with no

predefined conceptual definitions and is not easily
interpreted or analyzed by standard databases or data
models.
• Unstructured data accounts for the majority of big data
and comprises information such as dates, numbers, and
facts. Big data examples of this type include video and
audio files, mobile activity, satellite imagery, and No-
SQL databases
Semi-structured data
• Semi-structured data is a hybrid of structured and

unstructured data. This means that it inherits a few
characteristics of structured data but nonetheless
contains information that fails to have a definite
structure and does not conform with relational
databases or formal structures of data models.
• For instance, JSON and XML are typical examples of
semi-structured data.
Memory Unit
Characteristics of Big Data
• Big data can be characterized by five Vs.: volume,
variety, velocity, value, and veracity.
Volume
• Volume refers to the size of data generated and stored
in a Big Data system.
• We’re talking about the size of data in the petabytes
and exabytes range. These massive amounts of data
necessitate the use of advanced processing technology
—far more powerful than a typical laptop or desktop
CPU.
• Instagram or Twitter can be example of dataset having
massive volume.
Variety
• Variety entails the types of data that vary in format and
how it is organized and ready for processing.
• Big names such as Facebook, Twitter, Pinterest, Google

Ads, CRM systems produce data that can be collected,
stored, and subsequently analyzed.
Velocity
• The rate at which data accumulates also influences
whether the data is classified as big data or regular
data.
• Much of this data must be evaluated in real-time;

therefore, systems must be able to handle the pace and
amount of data created.
• The processing speed of data means that there will be

more and more data available than the previous data,
but it also implies that the velocity of data processing
Value
• It is not only the amount of data that we keep or
process that is important. It is also data that is valuable
and reliable and data that must be saved, processed,
and evaluated to get insights.
Veracity
• Veracity refers to the trustworthiness and quality of the
data. If the data is not trustworthy and/or reliable, then
the value of Big Data remains unquestionable.
• This is especially true when working with data that is

updated in real-time. Therefore, data authenticity
requires checks and balances at every level of the Big
Data collecting and processing process.
The Rise of Big Data
Google Big Data Case
Study
Big Data Case Study
Google File System
GFS
Challenges of Big Data
• Following are the challenges of the big data while used with
traditional RDBMS
Storage
Processing
Security
Storage
• With vast amounts of data generated daily, the greatest challenge is
storage (especially when the data is in different formats) within legacy
systems.
• Unstructured data cannot be stored in traditional databases.
Processing
• Processing big data refers to the reading, transforming, extraction,
and formatting of useful information from raw information.
• The input and output of information in unified formats continue to
present difficulties.
Security
• Security is a big concern for organizations. Non-encrypted information
is at risk of theft or damage by cyber-criminals.
• Therefore, data security professionals must balance access to data

against maintaining strict security protocols.
Challenges and Solution for Big Data
INTRODUCTION TO
HADOOP
Introduction to Hadoop
• Apache Hadoop is an open source framework that is
used to efficiently store and process large datasets
ranging in size from gigabytes to petabytes of data.
• Instead of using one large computer to store and

process the data, Hadoop allows clustering multiple
computers to analyze massive datasets in parallel more
quickly.
Hadoop
Components of Hadoop
HDFS
HDFS
• HDFS is the primary or major component of Hadoop
ecosystem and is responsible for storing large data sets
of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log
files.
• HDFS consists of two core components i.e.
• Name node
• Data Node
Components of HDFS
NameNode
• Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
• These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
DataNode
HDFS Cluster
Example
Features of HDFS
Hadoop MapReduce
• By making the use of distributed and parallel
algorithms, MapReduce makes it possible to carry over
the processing’s logic and helps to write applications
which transform big data sets into a manageable one.
• MapReduce makes the use of two functions i.e.
Map()
Reduce()
MapReduce
How MapReduce Works?
Example
Components of Hadoop 2.0
YARN
YARN Working
HADOOP USE CASE
Hadoop Use Case
• This use case illustrates the use of Hadoop for combating fraudulent
activities for a particular bank.
Bank Challenge
Approach Used by Bank
How Hadoop Solved The Problem
Hadoop Ecosystem
HDFS
YARN
MapReduce
Sqoop
Flume
Pig
Hive
Spark
Mahout
Ambari
Kafka
Storm
Ranger
Knox
Oozie

Unit 1 (1)

Uploaded by

Copyright:

Available Formats

Unit 1 (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 (1)

Uploaded by

Copyright:

Available Formats

Unit 1

• Structured data has certain predefined organizational

• Unstructured data entails information with no

• Semi-structured data is a hybrid of structured and

• Big names such as Facebook, Twitter, Pinterest, Google

• Much of this data must be evaluated in real-time;

• The processing speed of data means that there will be

• This is especially true when working with data that is

• Therefore, data security professionals must balance access to data

• Instead of using one large computer to store and

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.