Unit 1 Big Data
Unit 1 Big Data
● What does Big Data Analytics mean? Big data analytics refers to the strategy of analyzing large volumes of
data, or big data. This big data is gathered from a wide variety of sources, including social networks, videos,
digital images, sensors, and sales transaction records. The aim in analyzing all this data is to uncover patterns
and connections that might otherwise be invisible, and that might provide valuable insights about the users who
created it. Through this insight, businesses may be able to gain an edge over their rivals and make superior
business decisions.
Data Analytics vs Data Analysis
● Data Analytics is a advanced & ● Data analysis consists of defining a
Cont…..
broader data analysis. It includes data investigation, cleaning,
data analysis as a sub transforming data to give
Big data analytics applications enable big data analysts, data scientists, predictive
component.It sets the logical meaningful outcome.
modelers, statisticians and other analytics professionals to analyze growing volumes of
framework based on which ● For analyse the data Tableau ,Excel
structured transaction data, plus other forms of data .
analysis is done. etc.
For example, internet clickstream data, web server logs, social media content, text from ● There are many analytics tools in
customer emails and survey responses, mobile phone records, and machine data market mainly Python, Apache
captured by sensors connected to the internet of things (IoT). Spark etc
The importance of big data analytics:
Driven by specialized analytics systems and software, as well as high-
Data Analytics vs data powered computing systems, big data analytics offers various business
analysis benefits, including:
Structured data depends on the existence of a data model – a model of how data can be stored, processed and
accessed. Because of a data model, each field is discrete and can be accesses separately or jointly along with
data from other fields. This makes structured data extremely powerful: it is possible to quickly aggregate data
from various locations in the database.
Structured data is is considered the most ‘traditional’ form of data storage, since the earliest versions of
database management systems (DBMS) were able to store, process and access structured data.
Semi-structured Data
Unstructured Data
Semi-structured data is a form of structured data that does not conform with the formal
structure of data models associated with relational databases or other forms of data tables,
Unstructured data is information that either does not have a predefined data model or is not organised
in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as but nonetheless contain tags or other markers to separate semantic elements and enforce
dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to hierarchies of records and fields within the data. Therefore, it is also known as self-
understand using traditional programs as compared to data stored in structured databases. Common describing structure. Examples of semi-structured data include JSON and XML are forms of
examples of unstructured data include audio, video files or No-SQL databases semi-structured data.
● YARN: a cluster management technology and one of the key features in second-generation
Big data analytics technologies and tools: Hadoop.
● Unstructured and semi-structured data types typically don't fit well in ● MapReduce: a software framework that allows developers to write programs that process
traditional data warehouses that are based on relational databases massive amounts of unstructured data in parallel across a distributed cluster of processors
oriented to structured data sets. or stand-alone computers.
● Further, data warehouses may not be able to handle the processing ● Spark: an open source, parallel processing framework that enables users to run large-
demands posed by sets of big data that need to be updated frequently or scale data analytics applications across clustered systems.
even continually, as in the case of real-time data on stock trading, the ● Hive: an open source data warehouse system for querying and analyzing large data sets
online activities of website visitors or the performance of mobile stored in Hadoop files.
applications.
● As a result, many of the organizations that collect, process and analyze ● Kafka: a distributed publish/subscribe messaging system designed to replace traditional
message brokers.
big data turn to NoSQL databases, as well as Hadoop and its companion
data analytics tools, including: ● Pig: an open source technology that offers a high-level mechanism for the parallel
programming of MapReduce jobs executed on Hadoop clusters.
Exploring the use of Big Data in AGENDA
● We are going to discuss in 4 different areas of big data applications:
Business Context Part-2: ○Use of Big Data in Social Networking
● Almost all organisation collects relevant data (either directly or through agency). ○Use of Big Data in Preventing Fraudulent Activities
● This data is related to customer’s feedback, information about supplies and retails, ○Use of Big Data in Detecting Fraudulent Activities in Insurance Sector ○ Use of Big Data in
current market trends etc. Retail Sector
● The continuously increasing cost of collecting this information will be just a waste of
In each area we will discuss the following aspects:
resources unless some logical conclusion and business insight can be derived from it.
This is where big data Analytics come into picture. ● What is the data involved?
● This will help organisations to reduce the cycle time, fulfil orders quickly, and ● How to make optimum use of data?
improve forecast accuracy. ● What are the useful insights from analytics of the data?
Use of big data in social B. How to make optimum use of the social
networking Big data?
networking
A. What is social network data? ● https://youtu.be/JAO_3EvD3DY
● It refers to data generated from people socializing in social media websites ● Analysing and mining the larger volume of data in social networking sites
such as twitter, Facebook etc. such as comments, status, posts, likes etc. show the business trends in
● On a social media website you will find different people constantly adding general with respect to “wants” and “preferences” of a wide audience.
and updating comments, status, preferences etc. ● If this data can be systematically segregated on the basis of different age
● Following url shows the social network data generated per/seconds through group, locations, gender etc., and then organisation can design products
various social media. www.internetlivestats.com and services specific to people needs.
● This is called social network analytics.
● In fact the data generated from social networking analytics
EXAMPLE:
enable an organisation to calculate total revenue a customer
● Social networking analytics has even advanced applications can influence instead of the direct revenue he himself
such as predicting online reputation of a brand ex: tripadvisor, generate. ex: food blogger’s
increasing profitability in business by targeting influential
customers.
Preventing
Fraudulent ● Fraud can be committed by both words and behaviour intended to deceive the other
party generally to gain an advantage over the party.
●
Activities Here financial frauds are discussed.
● Frauds that occurs frequently in financial institution such as banks and insurance verification or (card verification value) CVV no etc, Fraudsters manage to
companies and involve any type of monetary transaction are called as financial manipulate the loopholes in the system.
frauds.
● In such frauds online retailers such as amazon, eBay, Groupon suffer huge losses, and
this is where big data analytics come to use.
Types of financial frauds 2. Exchange or return policy fraud: Occurs when people take advantage of exchange
1. Credit Card fraud: It type of fraud very common and related to use of return policies offered by an online retailers.
credit card facilities.
● Example: Customers returning the product after using it,
● Commonly occurs when a fake or a stolen card is used in an online transactions in reporting non-delivery and later attempting to sell it online
spite of security checks about the valid Owen of the card such as address etc.
● The online retailer can prevent such a fraud by charging a Restocking fee on
return goods, getting customer signatures on delivery, tracing customers known ● According to Consumer goods regulations once fraud is proved retails has to refund
to commits such frauds using their transaction patterns. the amount to the customers.
● This is where big data analytics come to use.
● For example: the retailer can study customers ordering patterns, frequency of
change in shipping address, rush orders, sudden huge orders etc.
3. Personal information fraud: This type of fraud occurs when the fraudsters obtain login
credentials of customers and purchase a product using them and changing the existing
delivery address they buy it.
When, the original customer realises this, he keeps calling the retailer to refund the
amount as he or she has not made the transaction.
Example: Secure OTP acting as a second round or check after CVV, Google pay introducing
How to make optimum use of customer an opening security pin apart from the regular pin for transactions.
data to prevent fraud. What are the useful insights from big
In order to deal with huge amount of data and gain data analytics in Real-time fraud
meaningful insights to avoid fraud, organisations need
● to derive analytics tools to differentiate between real
or genuine and fraudulent customer entries.
detection?
● Live data matching: In this study organisations can compare live details of customers
Organisations have to upgrade their knowledge about
obtained from different sources to validate the authenticity.
● emerging methods of fraud and design necessary ● Ex: In an online transaction, big data could compare the incoming IP address with the
prevention checks.
geo-data received from the customer’s smartphone apps. A valid match between the
two confirms the authenticity of the transactions.
● Ex: Also costly products can have sensors attached to them that transmits their
location information, when such products are delivered to customers the streaming
data obtained from the sensors provide good source of information to trace any fraud. Some of the examples include facial recognition (smart phone), position movement
analysis (Google maps) etc.
Analytical systems that deal with big data are designed to integrate and understand
ImageAnalytics images, videos, text, numbers and all forms of unstructured data to facilitate image
analytics.
What is the data available in insurance sector?? In general the company offering
insurance is always willing to improve its ability to take decisions while processing claims
and ensuring that the claim is a genuine one.
● The company has policies and procedures to help underwriters (an officer who
evaluates insurance coverage, claim details etc) however underwrites always do not
have the required data at the right time to make necessary decision, thus delaying
the processing time and increasing chances of frauds.
● Till before big data Insurance companies use to analysis small sample of data of the ● Big data can detect patterns of fraudulent behaviour from large amount of
customer and lesser parameters making it less full proof. structured and unstructured data given to it, ex: bank statement, medical bills,
criminal records etc., and help in detecting frauds quicker and insuring better
How to make optimum use of big data actions.
analytics in Insurance What are the useful insights from big data analytics
● As a solution to these problems big data analytical platforms increase the in Insurance?
availability of data about customers by integrating their internal data with data
● Social network analysis : Is a mixed approach using statistical methods pattern analysis
obtained from social media or other sources.
and link analysis to identify any kind of relationships within large amounts of data
● Ex : A customer might indicate that his/her car was destroyed in a flood but the collected from different sources for ex: data from public records such as criminal
documentation from the social media may show that the car was actually in records, address change
another city on the day the flood occur, this mismatch may hint existence of fraud. Frequency, foreclosures (legal processes in which bank recovers money from a customer
● Thus information obtained from these platforms will enable the insurance who has defaulted repayments), declaration of bankruptcy, are various data sources that
companies to diagnose customer claim behaviour and other related issues.
●
fraud.
Using this approach of incorporating information obtained from various data sources
management:
into a model, the insurance company can rate claims (a high rating indicates that claim is ● Social Customer relationship management is not a platform or technology, but a process.
fraud) ex: If a customer files a case to get insurance money of a car destroyed in fire, It makes it critical for insurances companies to link social media sites, such as Facebook
suppose we use Sentiment analysis on the customer’s statements in the claim reports
and Twitter, to their CRM systems.
and come across word like “valuable item removed from car etc.” then this might
● When social media is integrated within an organisation, it provides high transparency in
indicate the car was burnt on purpose.
various issues related to customers.
● For example: what time of the year do we sell maximum no of leggings and from
Retail industry ●
●
which channel?
Design promotional coupons for customers based on their ordering.
Further, to meet demand of new customers retailers are adopting specialized
What is big data in retail industry?? software applications for example: customers are given the information whether a
particular item is in stock in nearby store or not. (Apollo pharmacy).
● In the recent times Omni channel retailing process is a new buzz word, this process ● This is where big data analytics comes to use.
is the one which focuses on consumer experiences by using all available channels (as
the word Omni means all direction), including mobile, internet, television, How to make an optimum use retail
showrooms, radio, mails, apps, and many more evolving channels.
● Hence considering the immense number of transactions prevailing in the Omni data:RFID tech
channel retail industry from all channels, there is a lot of scope for the use of big
● The biggest evolution in automating the process of labelling and tracking detail
data technologies in extracting useful information such as relationship patterns,
goods is RFID (Radio frequency identification).
trends in the sales of product.
● Walmart is the 1st retailer to implement RFID in its products. Readers fixed at specific locations can observe and record all movements of the tag
● RFID helps better item tracking by differentiating items that are out of stock and assets with great accuracy.
that are available on shelf.
This information lessess the time for documentation also.
● With this technology the huge volume associated with transactional data of omni
Inventory control: RFID data allows manufacturers to track inventory for raw materials,
channel retailing can be easily handled and measures can be made for enhancing
works in progress (WIP) or finish goods (FG).
customer experiences.
Useful Insights from retail data Reader’s installs on shelves can update inventory automatically and rise alarms, in case
the requirement for restocking arises.
analytics: Further the readers can be programmed to rise an alarm in case items are removed and
placed elsewhere.
Asset management: Retail Organisations can tag their material handling equipment’s
such as pallets, vehicles, and tools with RFID in order to trace them any time and from Even Apollo pharmacy manages inventory of available drugs using this technology.
any locations.
Shipping and Receiving: RFID tags can be used to fasten the process of final shipping of
finished goods.
● Regulatory Compliance: To meet the regulations of agencies such as FBA (food and
Introducing technologies for
drug administration), OSHA (Occupational safety and health administration) etc.
Manufacturers need to dispatch products such as medicines, regulated drugs special
handling big data: Part-3
foods having preservatives, hazardous chemicals etc., with updated labels.
● RFID tags can be used as a labelling system for this goods.
● Huge amount of data from different sources need to be managed properly, to derive
● Also logistics companies like DTDC can also differentiate speed delivery products
productive results.
from normal delivery once using RFID tags. ● The astronomical increase in volume, velocity, variety of data collected from
Service and voluntary authorisations: RFID tags can hold updated information about different sources at the same time are forcing organisations to adopt a data analysis
repair and services done on the product. Once the repair and service has been completed strategy that can be used for analysing entire data in a very short time.
the information can be feed into the RFID tag on the product, so thus if future repairs are ● This is done with the help of a new software programs or applications, that do the
required, the technicians can access this information without accessing any external following :
database, which help in reducing calls and time expensive enquiries into document.
Above applications are based on the concepts of distributed and parallel computing
In this method complex computations are divided into sub tasks, which can be handled
individually by processing units, running in parallel.
● Cluster or grid computing : it is a form of parallel computing in which a bunch of High performance computing (HPC): HPC environments are the once that is specially
computers (often called nodes) are connected through a LAN and used to solve complex designed for processing floating point data at high speed.
operations so that they behave like a single machine.
● This will reduce down time and provide larger storage capacity. It is used in research and business organisations to develop specialized apps where
● Primarily used in Hadoop. accurate results is more valuable and strategic.
● Massive Parallel Processing : Primarily used in data warehousing.
Example: pollution level detection etc.
● MPP is widely used database management system for storing and analysing huge volume
of data.
● An MPP database has several independent pieces of data stored on multiple networks of
●
connected computers.
It eliminates the concept of one central server having a single CPU and disk.
https://www.youtube.com/watch?v=
● MPP platform examples are Greenplum and ParAccel (both popular database
management companies) bAyrObl 7TYE&t=184s
Hadoop - High Availability
Distributed Object Oriented
Platform. What it is and why it matters
Why Hadoop?
● Over the course of evolution of Big Data handling systems, Distributed computing
environments are used to process high volumes of data.
● However the multiple nodes in such an environment may not always cooperate
with each other (due to issue such as latency, data related problem, system delay
etc.) thus leaving a lot of scope for errors.
● In this context Hadoop evolved as a platform or framework providing an improved
programming model, which is used to create and run distributed systems quickly
What is Hadoop ?
and efficiently with least errors.
Hortonworks’ (a data software company based in California that developed
“Hadoop is a framework that allows you to and supported open source s/w for big data processing) definition: “An
open source software platform for distributed storage and provides
first store Big data in a distributed analytical technologies and computational power required to work
with large datasets.”
environment, so that, you can process it
■ it can run on an entire cluster instead of one PC
parallely”. ■ “Distributed storage”: A data set is spread across multiple hard drives.
If one of them burns down, the data is still reliably stored.
■ “Distributed processing”: Hadoop can aggregate data using many CPUs
in the cluster
●
solve the problem associated with big data.
Understanding Hadoop Ecosystem:
https://www.youtube.com/watch?v=
aReuLtY 0YMI
▪YARN will try to align all nodes to run jobs as efficiently as possible.
HIVE:
▪ Hive is an application that runs over the Hadoop framework and provides SQL like
interface for processing or querying data.
▪ HIVE provides a SQL-like interface for working on data stored on
Hadoop-integrated systems.
▪ That is, it makes your HDFS-stored data look like a SQL database.
SQL is a structured query language used for processing structured and semi-
structured data.
In short, it transforms the queries into efficient MapReduce or Spark jobs.
▪ HIVE is built upon Hadoop and the query for processing the data is given in hive, this
query is then converted into mapreduce program and then processed by Hadoop.
PIG:
▪ Pig is also platform for creating data processing programs that run on Hadoop. The
corresponding scripting language is called Pig Latin, has a SQL-similar syntax, and it
can perform MapReduce jobs. Without Pig, you would have to do more complex
programming in Java. Pig then transforms your scripts into something that will run on
MapReduce.
PIG Vs HIVE
Both components solve similar problems; they make it easy to write MapReduce
programs (easier than in Java, that is). Pig was created at Yahoo, Hive is originally
Oozie
from Facebook.
Oozie In Operation
● Apache Oozie is a schedular system to run and manage Hadoop jobs in a
distributed environment.
● It allows combining multiple complex jobs to be run in a sequential order to
Features Of Apache Oozie
achieve a bigger task.
● Oozie detects completion of tasks through callback and polling.
● When Oozie starts a task, it provides a unique callback HTTP URL to the ● Apache Oozie is a scheduler system to run and manage Hadoop jobs in a
task and notifies that URL when it is complete. distributed environment.
● If the task fails to invoke the callback URL, Oozie can poll the task for ● Oozie allows combining multiple complex jobs to be run in a sequential
completion. order to achieve the desired output.
● It is strongly integrated with Hadoop stack supporting various jobs like Pig,
Hive, Sqoop etc.
● Further,Oozie is able to manage the existing Hadoop machinery for
problems such as load balancing, fail-over, etc.
can be different tasks like Hive tasks, Pig task, Shell action etc.
FLUME ●
ingestion tool.
Flume is relevant in cases when the data is required to be brought from
multiple servers immediately into Hadoop.
● In such cases, Flume component is used to gather and aggregate large
amounts of data.
● It facilitates the streaming of huge volumes of log files from various sources
(like web servers such as Twitter, Facebook etc.) into the Hadoop Distributed
File System (HDFS).
Why Apache Flume?
● Organizations running multiple web services across multiple servers and
hosts will generate multitudes of log files on a daily basis.
● Also,When the rate of incoming data exceeds the rate at which data can be
written to the destination, Flume acts as a mediator between data producers
and the centralized stores and provides a steady flow of data between them.
ZOOKEEPER
● These log files will contain information about activities that are required for
both auditing and analytical purposes.
● HBase use when you need random, realtime read/write access to you Big
More on Data Storage part of HADOOP Data.
Ecosystem: HBASE ● With HBase database enterprise can create large tables with millions of rows
● HBase is a hadoop database an open source,non-relational,distributed, and columns on hardware machine.
column-oriented database developed as a part of HADOOP storage of data
in HDFS.
● It is beneficial when large amounts of information is required to be stored
,updated,and processed at a fast speed.
● While MapReduce enhances Big data processing ,Hbase takes care of its
storage and access requirements.
Features of HBase:
● Hbase helps programmers to store large quantities of data in such a way
that it can be accessed easily and quickly,as and when required.
● Hbase has low latency time,so scanning big datasets become easy.
DATA ● NoSQL, which stands for "not only SQL" is an alternative to traditional
relational databases.
● In NoSQL, data is placed in tables and it has flexible data schema - or no
Aadhar ID Name Age Personal Office Official ID Official schema at all - to better enable fast changes to applications that undergo
Phone No. address Phone No. continuous updates.
ID Phone addres l ID Phone No. ● In the HBase data model 1 Personal: {'Aadhaar ID’:
In a
s No. similar columns are
Relational/SQL/Row grouped into column 1383859182496:‘2345’, families.
oriented database , ‘Name’: 1383859182858:‘Sarita’,
read and write options 2345 Sarita 24 23456 HYD 1890 2405 ● Column families are stored together on disk, which is ‘Age: 1383859183001:’24’,
is done row wise. why HBase is referred to as a column-oriented data ‘Personal phone no’:
1234 Shanti 25 56789 BBSR 1501 3412 1383859182915:’23456’} store.
Official: {'Office
Whereas in Non
relational/NoSQL/Colu 2567 Leela 18 12367 MAA 1402 2820 ● Column families are address':13878909878:'HYD','Official grouped together on
disk, ID':12367890008:'1890','Official Phone
As can be seen , ‘Age: 1383859183001:’24’, ● The structure is created in terms of key value pairs
dynamically as and when the data is loaded.
If you want only the age of a ‘Personal phone no’: 1383859182915:’23456’} particular employee say row
1?A
● Just as Google Bigtable uses the distributed data storage provided by the
Summary and How Hbase is combined Google File System, Apache Hbase provides Bigtable-like capabilities on
top of Hadoop and HDFS.
with Hadoop
● Apache Hbase is the hadoop database, a distributed, column oriented,
scalable, big data store.
● Use Apache Hbase when you need random, realtime read/write access to
your Big Data.
● It’s goal is hosting large tables - billions of rows, X millions of columns - atop
clusters of commodity hardware.
● Apache Hbase is an open source, distributed, non-relational database
modelled after Google’s Bigtable : A Distributed Storage System for
Structured Data.