0% found this document useful (0 votes)
14 views34 pages

Unit 1 Big Data

unit 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

Unit 1 Big Data

unit 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Overview of Big Data: Part-1

Big Data Analytics


UNIT I ● What is Data Analytics? Data Analytics the science of examining raw data to conclude that information.Data
Analytics involves applying an algorithmic or mechanical process to derive insights and, for example, running
through several data sets to look for meaningful correlations between each other.It is used in several industries
to allow organizations and companies to make better decisions as well as verify and disprove existing theories
or models.

● What does Big Data Analytics mean? Big data analytics refers to the strategy of analyzing large volumes of
data, or big data. This big data is gathered from a wide variety of sources, including social networks, videos,
digital images, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns
and connections that might otherwise be invisible, and that might provide valuable insights about the users who

created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior
business decisions.
Data Analytics vs Data Analysis
● Data Analytics is a advanced & ● Data analysis consists of defining a
Cont…..
broader data analysis. It includes data investigation, cleaning,
data analysis as a sub transforming data to give
Big data analytics applications enable big data analysts, data scientists, predictive
component.It sets the logical meaningful outcome.
modelers, statisticians and other analytics professionals to analyze growing volumes of
framework based on which ● For analyse the data Tableau ,Excel
structured transaction data, plus other forms of data .
analysis is done. etc.
For example, internet clickstream data, web server logs, social media content, text from ● There are many analytics tools in
customer emails and survey responses, mobile phone records, and machine data market mainly Python, Apache
captured by sensors connected to the internet of things (IoT). Spark etc
The importance of big data analytics:
Driven by specialized analytics systems and software, as well as high-
Data Analytics vs data powered computing systems, big data analytics offers various business
analysis benefits, including:

● New revenue opportunities


● More effective marketing
● Better customer service
● Improved operational efficiency
● Competitive advantages over rivals

Structuring Big Data


Three different data structures
Structured Data
For the analysis of data, it is important to understand that there are three common Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyse.
types of data structures: Structured data conforms to a tabular format with relationship between the different rows and columns.
Common examples of structured data are Excel files or SQL databases. Each of these have structured rows and
columns that can be sorted.

Structured data depends on the existence of a data model – a model of how data can be stored, processed and
accessed. Because of a data model, each field is discrete and can be accesses separately or jointly along with
data from other fields. This makes structured data extremely powerful: it is possible to quickly aggregate data
from various locations in the database.

Structured data is is considered the most ‘traditional’ form of data storage, since the earliest versions of
database management systems (DBMS) were able to store, process and access structured data.
Semi-structured Data
Unstructured Data
Semi-structured data is a form of structured data that does not conform with the formal
structure of data models associated with relational databases or other forms of data tables,
Unstructured data is information that either does not have a predefined data model or is not organised
in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as but nonetheless contain tags or other markers to separate semantic elements and enforce
dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to hierarchies of records and fields within the data. Therefore, it is also known as self-
understand using traditional programs as compared to data stored in structured databases. Common describing structure. Examples of semi-structured data include JSON and XML are forms of
examples of unstructured data include audio, video files or No-SQL databases semi-structured data.

● YARN: a cluster management technology and one of the key features in second-generation
Big data analytics technologies and tools: Hadoop.

● Unstructured and semi-structured data types typically don't fit well in ● MapReduce: a software framework that allows developers to write programs that process
traditional data warehouses that are based on relational databases massive amounts of unstructured data in parallel across a distributed cluster of processors
oriented to structured data sets. or stand-alone computers.

● Further, data warehouses may not be able to handle the processing ● Spark: an open source, parallel processing framework that enables users to run large-
demands posed by sets of big data that need to be updated frequently or scale data analytics applications across clustered systems.
even continually, as in the case of real-time data on stock trading, the ● Hive: an open source data warehouse system for querying and analyzing large data sets
online activities of website visitors or the performance of mobile stored in Hadoop files.
applications.
● As a result, many of the organizations that collect, process and analyze ● Kafka: a distributed publish/subscribe messaging system designed to replace traditional
message brokers.
big data turn to NoSQL databases, as well as Hadoop and its companion
data analytics tools, including: ● Pig: an open source technology that offers a high-level mechanism for the parallel
programming of MapReduce jobs executed on Hadoop clusters.
Exploring the use of Big Data in AGENDA
● We are going to discuss in 4 different areas of big data applications:
Business Context Part-2: ○Use of Big Data in Social Networking
● Almost all organisation collects relevant data (either directly or through agency). ○Use of Big Data in Preventing Fraudulent Activities
● This data is related to customer’s feedback, information about supplies and retails, ○Use of Big Data in Detecting Fraudulent Activities in Insurance Sector ○ Use of Big Data in
current market trends etc. Retail Sector
● The continuously increasing cost of collecting this information will be just a waste of
In each area we will discuss the following aspects:
resources unless some logical conclusion and business insight can be derived from it.
This is where big data Analytics come into picture. ● What is the data involved?
● This will help organisations to reduce the cycle time, fulfil orders quickly, and ● How to make optimum use of data?
improve forecast accuracy. ● What are the useful insights from analytics of the data?

Use of big data in social B. How to make optimum use of the social
networking Big data?
networking
A. What is social network data? ● https://youtu.be/JAO_3EvD3DY
● It refers to data generated from people socializing in social media websites ● Analysing and mining the larger volume of data in social networking sites
such as twitter, Facebook etc. such as comments, status, posts, likes etc. show the business trends in
● On a social media website you will find different people constantly adding general with respect to “wants” and “preferences” of a wide audience.
and updating comments, status, preferences etc. ● If this data can be systematically segregated on the basis of different age
● Following url shows the social network data generated per/seconds through group, locations, gender etc., and then organisation can design products
various social media. www.internetlivestats.com and services specific to people needs.
● This is called social network analytics.
● In fact the data generated from social networking analytics
EXAMPLE:
enable an organisation to calculate total revenue a customer
● Social networking analytics has even advanced applications can influence instead of the direct revenue he himself
such as predicting online reputation of a brand ex: tripadvisor, generate. ex: food blogger’s
increasing profitability in business by targeting influential
customers.

● ex: insta influencer.

I. Customer relationship management data: with the help of social networking


analytics, organisations can identify some customers in the customer’s networks
C. What are the useful insights from Big Data in social networking?
that make a large no of calls, text messages and have a large network of friends.
The following are the areas in which decision making processes of organisation is Such a customer is said to be highly influential as studies have shown that when a
influenced by social networking data: user of a telephone networks leaves his friend also leaves. In fact some
organisations reward their influencer customers with discount and offers. And these
Business Intelligence: it is a data analysis process to convert a raw data customers in turn spreading a positive brand image. Other sector ex - Google pay,
analysis process to convert a raw data set to meaningful information Airtel etc.
that can add value to decision making. II. Link Analysis: Social network analytics can also help in law enforcement and anti-
terrorism efforts as it is possible to identify trouble groups or people who are
Social networking data and its appropriate analysis has proven to be a
directly or indirectly connected to each other. Such type of analysis is called LINK
good aid in providing business intelligence.
ANALYSIS.
This can be understood from following examples: from different sector
in business.
III. Marketing: Today’s preferences of consumers have changed due to their busy
schedules. So marketers aim to deliver what consumers desire by using interactive Sentiment analysis refers to a computer programming technique to analyse human
communication channels such as email,mobile,web etc. emotions, attitudes and views across popular social networking, including
Facebook, Twitter and blogs.
Example: Walmart has started a social media Analytics Company called kosmix and
established a branch Walmart lab. It analysis media communication such as blog, The techniques requires analytics skills as well as advanced computing applications.
twits, transactions data etc to predicts trends and learn about customers wants. Business research organisations and marketing professionals across the global use
sentiment analysis in one form or the other to identify and measure customer
behaviour and online trends.
IV. Product Design and Development: By listening to what consumers want, by
understanding where the gap in the product offering is and so on, organisation can
make the right decision in the direction of their product design and development.

What are fraudulent Activities??

Preventing
Fraudulent ● Fraud can be committed by both words and behaviour intended to deceive the other
party generally to gain an advantage over the party.

Activities Here financial frauds are discussed.
● Frauds that occurs frequently in financial institution such as banks and insurance verification or (card verification value) CVV no etc, Fraudsters manage to
companies and involve any type of monetary transaction are called as financial manipulate the loopholes in the system.
frauds.
● In such frauds online retailers such as amazon, eBay, Groupon suffer huge losses, and
this is where big data analytics come to use.

Types of financial frauds 2. Exchange or return policy fraud: Occurs when people take advantage of exchange
1. Credit Card fraud: It type of fraud very common and related to use of return policies offered by an online retailers.
credit card facilities.
● Example: Customers returning the product after using it,
● Commonly occurs when a fake or a stolen card is used in an online transactions in reporting non-delivery and later attempting to sell it online
spite of security checks about the valid Owen of the card such as address etc.

● The online retailer can prevent such a fraud by charging a Restocking fee on
return goods, getting customer signatures on delivery, tracing customers known ● According to Consumer goods regulations once fraud is proved retails has to refund
to commits such frauds using their transaction patterns. the amount to the customers.
● This is where big data analytics come to use.
● For example: the retailer can study customers ordering patterns, frequency of
change in shipping address, rush orders, sudden huge orders etc.
3. Personal information fraud: This type of fraud occurs when the fraudsters obtain login
credentials of customers and purchase a product using them and changing the existing
delivery address they buy it.

When, the original customer realises this, he keeps calling the retailer to refund the
amount as he or she has not made the transaction.
Example: Secure OTP acting as a second round or check after CVV, Google pay introducing
How to make optimum use of customer an opening security pin apart from the regular pin for transactions.

data to prevent fraud. What are the useful insights from big
In order to deal with huge amount of data and gain data analytics in Real-time fraud
meaningful insights to avoid fraud, organisations need
● to derive analytics tools to differentiate between real
or genuine and fraudulent customer entries.
detection?
● Live data matching: In this study organisations can compare live details of customers
Organisations have to upgrade their knowledge about
obtained from different sources to validate the authenticity.
● emerging methods of fraud and design necessary ● Ex: In an online transaction, big data could compare the incoming IP address with the
prevention checks.
geo-data received from the customer’s smartphone apps. A valid match between the
two confirms the authenticity of the transactions.

● Ex: Also costly products can have sensors attached to them that transmits their
location information, when such products are delivered to customers the streaming
data obtained from the sensors provide good source of information to trace any fraud. Some of the examples include facial recognition (smart phone), position movement
analysis (Google maps) etc.

Analytical systems that deal with big data are designed to integrate and understand

ImageAnalytics images, videos, text, numbers and all forms of unstructured data to facilitate image
analytics.

This is another emerging field that can help detect


frauds. Image analysis also known as “computer
vision” or image recognition) is the ability of
computers to recognize attributes within an image.
MPP (Massively Parallel Processing
database)
Use of big data in ● This technology is used in powerful fraud management systems in order to detect
frauds. The system analyses each customer transaction on the basis of 500 different
detecting fraud in criteria’s or aspects to differentiate between a real and fraudulent transaction.
● This level of analytic scalability needs a MPP system.
Insurance sector ● MPP is widely used database management system for storing and analysing huge
volume of data.
● An MPP database has several independent pieces of data stored on multiple
networks of connected computers.
● It eliminates the concept of one central server having a single CPU and disk.
● VISA payment services make use of MPP in its fraud management system.

Fig: Use of big data in detecting fraud in


Insurance sector
● This important to study because most cases of cheating and fraudulent activities
occurs in insurance and retail sector.

What is the data available in insurance sector?? In general the company offering
insurance is always willing to improve its ability to take decisions while processing claims
and ensuring that the claim is a genuine one.

● The company has policies and procedures to help underwriters (an officer who
evaluates insurance coverage, claim details etc) however underwrites always do not
have the required data at the right time to make necessary decision, thus delaying
the processing time and increasing chances of frauds.
● Till before big data Insurance companies use to analysis small sample of data of the ● Big data can detect patterns of fraudulent behaviour from large amount of
customer and lesser parameters making it less full proof. structured and unstructured data given to it, ex: bank statement, medical bills,
criminal records etc., and help in detecting frauds quicker and insuring better
How to make optimum use of big data actions.

analytics in Insurance What are the useful insights from big data analytics
● As a solution to these problems big data analytical platforms increase the in Insurance?
availability of data about customers by integrating their internal data with data
● Social network analysis : Is a mixed approach using statistical methods pattern analysis
obtained from social media or other sources.
and link analysis to identify any kind of relationships within large amounts of data
● Ex : A customer might indicate that his/her car was destroyed in a flood but the collected from different sources for ex: data from public records such as criminal
documentation from the social media may show that the car was actually in records, address change
another city on the day the flood occur, this mismatch may hint existence of fraud. Frequency, foreclosures (legal processes in which bank recovers money from a customer
● Thus information obtained from these platforms will enable the insurance who has defaulted repayments), declaration of bankruptcy, are various data sources that
companies to diagnose customer claim behaviour and other related issues.

Social Customer relationship


Can be assimilated into the SNA model, which helps to effectively detect existence of


fraud.
Using this approach of incorporating information obtained from various data sources
management:
into a model, the insurance company can rate claims (a high rating indicates that claim is ● Social Customer relationship management is not a platform or technology, but a process.
fraud) ex: If a customer files a case to get insurance money of a car destroyed in fire, It makes it critical for insurances companies to link social media sites, such as Facebook
suppose we use Sentiment analysis on the customer’s statements in the claim reports
and Twitter, to their CRM systems.
and come across word like “valuable item removed from car etc.” then this might
● When social media is integrated within an organisation, it provides high transparency in
indicate the car was burnt on purpose.
various issues related to customers.
● For example: what time of the year do we sell maximum no of leggings and from

Retail industry ●

which channel?
Design promotional coupons for customers based on their ordering.
Further, to meet demand of new customers retailers are adopting specialized

What is big data in retail industry?? software applications for example: customers are given the information whether a
particular item is in stock in nearby store or not. (Apollo pharmacy).
● In the recent times Omni channel retailing process is a new buzz word, this process ● This is where big data analytics comes to use.
is the one which focuses on consumer experiences by using all available channels (as
the word Omni means all direction), including mobile, internet, television, How to make an optimum use retail
showrooms, radio, mails, apps, and many more evolving channels.
● Hence considering the immense number of transactions prevailing in the Omni data:RFID tech
channel retail industry from all channels, there is a lot of scope for the use of big
● The biggest evolution in automating the process of labelling and tracking detail
data technologies in extracting useful information such as relationship patterns,
goods is RFID (Radio frequency identification).
trends in the sales of product.

● Walmart is the 1st retailer to implement RFID in its products. Readers fixed at specific locations can observe and record all movements of the tag
● RFID helps better item tracking by differentiating items that are out of stock and assets with great accuracy.
that are available on shelf.
This information lessess the time for documentation also.
● With this technology the huge volume associated with transactional data of omni
Inventory control: RFID data allows manufacturers to track inventory for raw materials,
channel retailing can be easily handled and measures can be made for enhancing
works in progress (WIP) or finish goods (FG).
customer experiences.

Useful Insights from retail data Reader’s installs on shelves can update inventory automatically and rise alarms, in case
the requirement for restocking arises.

analytics: Further the readers can be programmed to rise an alarm in case items are removed and
placed elsewhere.
Asset management: Retail Organisations can tag their material handling equipment’s
such as pallets, vehicles, and tools with RFID in order to trace them any time and from Even Apollo pharmacy manages inventory of available drugs using this technology.
any locations.
Shipping and Receiving: RFID tags can be used to fasten the process of final shipping of
finished goods.
● Regulatory Compliance: To meet the regulations of agencies such as FBA (food and
Introducing technologies for
drug administration), OSHA (Occupational safety and health administration) etc.
Manufacturers need to dispatch products such as medicines, regulated drugs special
handling big data: Part-3
foods having preservatives, hazardous chemicals etc., with updated labels.
● RFID tags can be used as a labelling system for this goods.
● Huge amount of data from different sources need to be managed properly, to derive
● Also logistics companies like DTDC can also differentiate speed delivery products
productive results.
from normal delivery once using RFID tags. ● The astronomical increase in volume, velocity, variety of data collected from
Service and voluntary authorisations: RFID tags can hold updated information about different sources at the same time are forcing organisations to adopt a data analysis
repair and services done on the product. Once the repair and service has been completed strategy that can be used for analysing entire data in a very short time.
the information can be feed into the RFID tag on the product, so thus if future repairs are ● This is done with the help of a new software programs or applications, that do the
required, the technicians can access this information without accessing any external following :
database, which help in reducing calls and time expensive enquiries into document.

○ Breaking up the given tasks into sub-tasks


○ surveying the available resource on hand
○ assigning the sub-task to the nodes or computing devices that are interconnected
via network.
○ Finally collecting outputs from all subtasks

Above applications are based on the concepts of distributed and parallel computing

Distributed Computing and parallel


computing for Big data
● Distributed Computing: In distributed computing, multiple computing resources are
connected in a network and computing task are distributed across this resources.
Characteristics of Distributed
System
● No shared clock
● No shared memory
● Concurrency
● Heterogeneity and Loose Coupling
Parallel Computing: this is another way to improve the processing capability of a
computer system by adding additional computational resources to it.

In this method complex computations are divided into sub tasks, which can be handled
individually by processing units, running in parallel.

In general organisations use a combination of parallel and distributed techniques to


process big data.

Issues in big data handling systems:


● Latency: can be defined as the aggregate delay in the system because of delays in
the completion of individual tasks.
○ Such a delay automatically leads to the slowdown in system performance as a
whole and this is often termed as System Delay.
○ The number of nodes designed in the distributed computing system to process
individual tasks determines the level of scalability of the big data system.
○ Thus implementing distributed and parallel computing methodologies helps in
handling latency.
● Load Balancing : The sharing of workload across various systems throughout the
network to manage the load is known as load balancing.
● Distributed and parallel computing methodologies make use of load balancing
feature to handle growing amounts of big data more efficiently and flexibility.
Special techniques of Distributed &
● Virtualization: Big data virtualization is a process of creating virtual structures for
big data systems such as the hardware platform, storage device and operating
Parallel computing:
system etc to meet the goals and objectives of big data analytics. ● The distributed and parallel computing techniques has been around almost 50 years
initially the technology was used in computer science research to solve complex
○ This virtualization helps the organisations to understand and navigate easily
problems by increasing scalability without investing on massive computing system.
the flow of information across these physical systems.
● Over the period of time , concepts of Distributed & Parallel computing technology
○ Distributed and parallel computing methodologies make use of virtualisation to
has evolved into a number of techniques to process and manage huge amounts of
segregate the processing and analysis task in a systematic framework to
data produced at a high velocity.
minimize errors.
● Some of these techniques are shown below :

● Cluster or grid computing : it is a form of parallel computing in which a bunch of High performance computing (HPC): HPC environments are the once that is specially
computers (often called nodes) are connected through a LAN and used to solve complex designed for processing floating point data at high speed.
operations so that they behave like a single machine.
● This will reduce down time and provide larger storage capacity. It is used in research and business organisations to develop specialized apps where
● Primarily used in Hadoop. accurate results is more valuable and strategic.
● Massive Parallel Processing : Primarily used in data warehousing.
Example: pollution level detection etc.
● MPP is widely used database management system for storing and analysing huge volume
of data.
● An MPP database has several independent pieces of data stored on multiple networks of


connected computers.
It eliminates the concept of one central server having a single CPU and disk.
https://www.youtube.com/watch?v=
● MPP platform examples are Greenplum and ParAccel (both popular database
management companies) bAyrObl 7TYE&t=184s
Hadoop - High Availability
Distributed Object Oriented
Platform. What it is and why it matters
Why Hadoop?

● Over the course of evolution of Big Data handling systems, Distributed computing
environments are used to process high volumes of data.
● However the multiple nodes in such an environment may not always cooperate
with each other (due to issue such as latency, data related problem, system delay
etc.) thus leaving a lot of scope for errors.
● In this context Hadoop evolved as a platform or framework providing an improved
programming model, which is used to create and run distributed systems quickly
What is Hadoop ?
and efficiently with least errors.
Hortonworks’ (a data software company based in California that developed
“Hadoop is a framework that allows you to and supported open source s/w for big data processing) definition: “An
open source software platform for distributed storage and provides
first store Big data in a distributed analytical technologies and computational power required to work
with large datasets.”
environment, so that, you can process it
■ it can run on an entire cluster instead of one PC
parallely”. ■ “Distributed storage”: A data set is spread across multiple hard drives.
If one of them burns down, the data is still reliably stored.
■ “Distributed processing”: Hadoop can aggregate data using many CPUs
in the cluster

■ Multiple data modification : Hadoop is a better fit only if we are


When to use Hadoop ? primarily concerned about reading data and not modifying data.
● Search – Yahoo, Amazon ■ Lots of small files : Hadoop is suitable for scenarios, where we
● Log processing – Facebook, Yahoo have few but large files.

● Data Warehouse – Facebook Evolution of Hadoop


● Video and Image Analysis – New York Times ● In 2003, Doug Cutting (a s/w designer who invented open-source
search technologies) launches project “Nutch” to handle billions
When not to use Hadoop ? of searches and indexing millions of web pages.

■ Low Latency data access : Quick access to small parts of data


● Later in Oct 2003 – Google releases papers with GFS (Google File ● Later in Jan 2008, Yahoo released Hadoop as an open source
System). project to Apache Software Foundation.
● In Dec 2004, Google releases papers with MapReduce. ● In July 2008, Apache tested a 4000 node cluster with Hadoop
● In 2005, Nutch used GFS and MapReduce to perform operations. successfully.
● In 2006, Yahoo created Hadoop based on GFS and MapReduce ● In 2009, Hadoop successfully sorted a petabyte of data in less
with Doug Cutting and team than 17 hours to handle billions of searches and indexing millions
of web pages.
Evolution of Hadoop ●

Moving ahead in Dec 2011, Apache Hadoop released version 1.0.
Later in Aug 2013, Version 2.0.6 was available.
● In 2007 Yahoo started using Hadoop on a 1000 node cluster.

● There are 4 major services provided :


○ Data processing (tools being map reduce, Yarn)
Hadoop Ecosystem: ○ Data storage (tools are HDFS, HBASE)
● As we understand Hadoop is open source s/w framework (a set of prog written in ○ Data access (tools are HIVE, PIG, SQOOP etc.)
Java) that allows for massively parallel computing (allowing big data sets to be
○ Data management (tools are OOZIE, FLUME, ZOOKEEPER
stored and spread across multiple servers with little reduction in performance)
● Being a framework Hadoop is made up of several modules that are supported by a etc.)
large ecosystem of technologies.
● Thus Hadoop ecosystem is defined as a platform which provides various services to


solve the problem associated with big data.
Understanding Hadoop Ecosystem:
https://www.youtube.com/watch?v=
aReuLtY 0YMI

● Oozie: Job Scheduling


Following are the components that
collectively form a Hadoop ecosystem:
● HDFS: Hadoop Distributed File System
HADOOP ECOSYSTEM Contd.
● YARN: Yet Another Resource Negotiator •HDFS,Mapreduce,YARN are the core components of Apache Hadoop and they form the
basic distributed Hadoop framework.
● MapReduce: Programming based Data Processing
•There are several other Hadoop components that form an integral part of the Hadoop
● Spark: In-Memory data processing ecosystem with the intent of enhancing the power of Apache Hadoop in some way or the
other like- providing better integration with databases, making Hadoop faster or
● PIG, HIVE: Query based processing of data services
developing novel features and functionalities.
● HBase: NoSQL Database • In the next few slides ,we will discuss some of the eminent Hadoop components used by
enterprises extensively and mentioned in our syllabus.
● Mahout : Machine Learning algorithm libraries
•They are –Mahout,Sqoop,Oozie,Flume,Zookeeper,Hbase
● Zookeeper: Managing cluster
HDFS – Hadoop Distributed File System ▪ HDFS creates an abstraction.
▪ In HDFS the abstraction means representing the data over the blocks of a file
rather than a single file which simplifies the storage subsystem.
▪ Similar as virtualization, you can see HDFS logically as a single unit for storing
Big Data, but actually you are storing your data across multiple nodes in a
distributed fashion.
▪ HDFS follows master-slave architecture.

HDFS – Illustration HDFS Architecture


HDFS – Hadoop Distributed File System What is YARN (Yet Anot her Resource Negotiator)?
In HDFS, ● HADOOP YARN (Yet Another Resource Negotiator) is a component of hadoop
ecosystem responsible for allocating cluster resources to various applications
▪ Namenode is the master node and
(Mapreduce programs) running in a hadoop cluster.
▪ Datanodes are the slaves. ● It schedules the tasks required to be executed as per mapreduce programs on
different cluster nodes.
▪ Namenode contains the metadata about the data stored in Data nodes, such as ● It is called as the operating system of hadoop as it is responsible for managing and
which data block is stored in which data node, where are the replications of monitoring of the execution of the jobs.
the data block kept etc. ● It is specialized for distributed computing.
● Performs all data processing activities by allocating resources and scheduling tasks.
▪ The actual data is stored in Data Nodes.
YARN
● YARN stands for Yet another Resource
Negotiator.
Map Reduce: is a programming paradigm ● It was introduced in Hadoop 2, where the resource negotiation part was
out from MapReduce.
split
that enables massive scalability across
hundreds or thousands of servers in a Hadoop
cluster.
● Where HDFS splits up the data storage across your cluster, YARN splits up
Computation.
the

▪YARN will try to align all nodes to run jobs as efficiently as possible.
HIVE:
▪ Hive is an application that runs over the Hadoop framework and provides SQL like
interface for processing or querying data.
▪ HIVE provides a SQL-like interface for working on data stored on
Hadoop-integrated systems.
▪ That is, it makes your HDFS-stored data look like a SQL database.

SQL is a structured query language used for processing structured and semi-
structured data.
In short, it transforms the queries into efficient MapReduce or Spark jobs.

▪ HIVE is built upon Hadoop and the query for processing the data is given in hive, this
query is then converted into mapreduce program and then processed by Hadoop.

PIG:

▪ Pig is also platform for creating data processing programs that run on Hadoop. The
corresponding scripting language is called Pig Latin, has a SQL-similar syntax, and it
can perform MapReduce jobs. Without Pig, you would have to do more complex
programming in Java. Pig then transforms your scripts into something that will run on
MapReduce.
PIG Vs HIVE

 Both components solve similar problems; they make it easy to write MapReduce
programs (easier than in Java, that is). Pig was created at Yahoo, Hive is originally
Oozie
from Facebook.

● Oozie is an orchestration system for Hadoop jobs.


● Oozie is an Open Source Java Web-Application
available under Apache.
● Oozie is designed to run multistage Hadoop jobs as a
single job : an Oozie job.

Oozie In Operation
● Apache Oozie is a schedular system to run and manage Hadoop jobs in a
distributed environment.
● It allows combining multiple complex jobs to be run in a sequential order to
Features Of Apache Oozie
achieve a bigger task.
● Oozie detects completion of tasks through callback and polling.
● When Oozie starts a task, it provides a unique callback HTTP URL to the ● Apache Oozie is a scheduler system to run and manage Hadoop jobs in a
task and notifies that URL when it is complete. distributed environment.
● If the task fails to invoke the callback URL, Oozie can poll the task for ● Oozie allows combining multiple complex jobs to be run in a sequential
completion. order to achieve the desired output.
● It is strongly integrated with Hadoop stack supporting various jobs like Pig,
Hive, Sqoop etc.
● Further,Oozie is able to manage the existing Hadoop machinery for
problems such as load balancing, fail-over, etc.

Workflow in Oozie Types of Apache Oozie Jobs


Following three types of jobs are common in Oozie −

● Oozie Workflow Jobs − Oozie jobs running on demand.Workflow actions

can be different tasks like Hive tasks, Pig task, Shell action etc.

● Oozie Coordinator Jobs − Oozie jobs running periodically.

● Oozie Bundle − It is a collection of coordinator jobs managed as a single


job.
● Sqoop component is used for importing data from external
sources such as relational databases & variously structured data
marts into related Hadoop components like HDFS, HBase or Hive
etc.
● Sqoop mainly helps in moving data from an enterprise database to
Sqoop Hadoop cluster to performing the ETL (Extract, transform,
load)process.
● It can also be used for exporting data from Hadoop components to
external structured data stores as shown below:

● It facilitates feature to transfer data parallelly for effective performance and


optimal system utilization.
● Sqoop creates fast data copies from an external source into Hadoop.
● It acts as a load balancer by mitigating extra storage and processing loads to
● other devices.

Special features of Sqoop:


● Apache Sqoop undertakes the following tasks to integrate bulk data movement
between Hadoop and structured databases.
● Sqoop fulfills the growing need to transfer data from the mainframe to HDFS.
What is Flume?
● Apache Flume is an efficient, distributed, reliable, and fault-tolerant data-

FLUME ●
ingestion tool.
Flume is relevant in cases when the data is required to be brought from
multiple servers immediately into Hadoop.
● In such cases, Flume component is used to gather and aggregate large
amounts of data.

● It facilitates the streaming of huge volumes of log files from various sources
(like web servers such as Twitter, Facebook etc.) into the Hadoop Distributed
File System (HDFS).
Why Apache Flume?
● Organizations running multiple web services across multiple servers and
hosts will generate multitudes of log files on a daily basis.
● Also,When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data producers
and the centralized stores and provides a steady flow of data between them.
ZOOKEEPER
● These log files will contain information about activities that are required for
both auditing and analytical purposes.

Introduction Features of Zookeeper contd.


● Apache Zookeeper is an open source software framework designed to ● Zookeeper is a coordinator. Many other tools,HDFS,HBase rely on it.
coordinate multiple services in the Hadoop ecosystem. ● It can keep track of what node is up/down, which one is the master node, what
● Organizing and maintaining a service in a distributed environment is a workers are available, and many more things.
complicated task. ● It keeps track of things that can go wrong on the cluster include the master
● ZooKeeper allows developers to focus on core application logic without node crashing, a worker crashing, or network trouble, where a part of the
worrying about the distributed nature of the application. cluster can’t see the rest of it anymore etc.
● ZooKeeper is a distributed co-ordination service to manage large set of hosts. ● Zookeeper sits on the side of your system and tries to maintain a consistent
picture of state on the entire distributed system in a consistent manner.
HBase is a column-oriented non-relational
Hbase database management system that runs on top
of Hadoop Distributed File System (HDFS).

● HBase use when you need random, realtime read/write access to you Big
More on Data Storage part of HADOOP Data.
Ecosystem: HBASE ● With HBase database enterprise can create large tables with millions of rows
● HBase is a hadoop database an open source,non-relational,distributed, and columns on hardware machine.
column-oriented database developed as a part of HADOOP storage of data
in HDFS.
● It is beneficial when large amounts of information is required to be stored
,updated,and processed at a fast speed.
● While MapReduce enhances Big data processing ,Hbase takes care of its
storage and access requirements.
Features of HBase:
● Hbase helps programmers to store large quantities of data in such a way
that it can be accessed easily and quickly,as and when required.

● It stores data in a compressed format and thus occupies less memory


space.

● Hbase has low latency time,so scanning big datasets become easy.

● Compared to Relational databases which are row oriented ,HBase which is


based on a Columnar database,the data of all rows is saved column

wise.This makes easy to add additional feature of the dataset represented


by a column. Why Hbase ?
● In case you have a large volume and variety/diversity of data,it is
recommended to use Columnar/column oriented database. ● HBase is a specialized file system used in HDFS which is relevant in the
following cases :
● Hbase is suitable where the data changes gradually as Demographic ○ More random read and write access to data.
data,IP addresses data etc. and also rapidly changing data as application ○ When you want the data to be stored in a more structured fashion.
logs,clickstream data,in-game usage data etc. ○ When the velocity of data is very high.
○ When the log data of a website needs to be stored.
Example : facebook data stored in HBase
Basic Blocks of HBase
Apache HBase (HBase) is a NoSQL /Column Oriented ,Distributed ,Key Value
store and Scalable database built on top of HDFS.
HBase is a sub-project of the Apache Hadoop project and is used to provide
real-time & random read and write access to your big data.

● Major difference between them is in its read and write


options.

DATA ● NoSQL, which stands for "not only SQL" is an alternative to traditional
relational databases.
● In NoSQL, data is placed in tables and it has flexible data schema - or no
Aadhar ID Name Age Personal Office Official ID Official schema at all - to better enable fast changes to applications that undergo
Phone No. address Phone No. continuous updates.

2345 Sarita 24 23456 HYD 1890 2405

1234 Shanti 25 56789 BBSR 1501 3412

2567 Leela 18 12367 MAA 1402 2820

1256 Swati 45 45678 DEL 1678 2934


Difference in read mn oriented database ,read is done row-wise but write is done column- 1256 Swati 45 45678 DEL 1678 2934

and write in SQL wise


Row KeyColumn Family: {Column
vs NoSQL: Aadhar Name Age Personal Office Officia Official
Qualifier:Version:Value}

ID Phone addres l ID Phone No. ● In the HBase data model 1 Personal: {'Aadhaar ID’:
In a
s No. similar columns are
Relational/SQL/Row grouped into column 1383859182496:‘2345’, families.
oriented database , ‘Name’: 1383859182858:‘Sarita’,
read and write options 2345 Sarita 24 23456 HYD 1890 2405 ● Column families are stored together on disk, which is ‘Age: 1383859183001:’24’,
is done row wise. why HBase is referred to as a column-oriented data ‘Personal phone no’:
1234 Shanti 25 56789 BBSR 1501 3412 1383859182915:’23456’} store.
Official: {'Office
Whereas in Non
relational/NoSQL/Colu 2567 Leela 18 12367 MAA 1402 2820 ● Column families are address':13878909878:'HYD','Official grouped together on
disk, ID':12367890008:'1890','Official Phone

so grouping data with no':567890999:'2405' similar access Official: {'Office key-


patterns value pair can look like this:
address':13878909878:'HYD','Official
reduces overall disk 2 access and
1:(Personal:Age:1383859183001) ID':12367890008:'1890','Official Phone
increases
performance. 3 => 24
no':567890999:'2405'

Row Key Column Family: {Column


Qualifier:Version:Value}
● Random Access to a file means that the computer system
1 Personal: {'Aadhaar ID’: can read or write information anywhere in the data file.
All the parts of the HBase data
1383859182496:‘2345’, ● When we say Realtime read and write ,we mean that we don't
model converge into a key-value need to define the structure/schema of the table when you
pair. ‘Name’: 1383859182858:‘Sarita’, load data.

As can be seen , ‘Age: 1383859183001:’24’, ● The structure is created in terms of key value pairs
dynamically as and when the data is loaded.
If you want only the age of a ‘Personal phone no’: 1383859182915:’23456’} particular employee say row
1?A
● Just as Google Bigtable uses the distributed data storage provided by the
Summary and How Hbase is combined Google File System, Apache Hbase provides Bigtable-like capabilities on
top of Hadoop and HDFS.
with Hadoop
● Apache Hbase is the hadoop database, a distributed, column oriented,
scalable, big data store.
● Use Apache Hbase when you need random, realtime read/write access to
your Big Data.
● It’s goal is hosting large tables - billions of rows, X millions of columns - atop
clusters of commodity hardware.
● Apache Hbase is an open source, distributed, non-relational database
modelled after Google’s Bigtable : A Distributed Storage System for
Structured Data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy