0% found this document useful (0 votes)
10 views

Bigdatanalyticsintro

Uploaded by

tanishgontlag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Bigdatanalyticsintro

Uploaded by

tanishgontlag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Big Data Analytics

BY
Prof.Srivatsala V
Classification of Digital Data
• UnStructured Data- This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer
program.
• Semi-Structured Data -This is the data which does not conform to a
data model but has some structure.
• Structured Data -This is the data which is in an organized form (e.g., in
rows and columns) and can be easily used by a computer program.
Classification of Digital Data
Structured Data
Structured Data
Sources of Structured Data
Ease of working with Structured Data

• Insert/Delete/Update
• Security
• Indexing
• Scalability
• Transaction Processing
• -- ACID properties of transaction processing
Semi Structured Data
Sources of Semi Structured data
• JSON
• XML
• HTML
Unstructured Data

• Examples
Sources of Unstructured Data
Issues with Unstructured data
Dealing with Unstructured Data
Dealing with Unstructured Data
• 1.Data Mining – unearthing consistent patterns in large data sets and/or systematic
relationships between variables. It is the analysis step of the “knowledge discovery in
databases” process.
• Few popular data mining algorithms are as follows:
• • Association rule mining: It is also called “market basket analysis” or “affinity analysis”.
It is used to determine “What goes with what?” It is about when you buy a product,
what is the other product that you are likely to purchase with it.
• For example, if you pick up bread from the grocery, are you likely to pick eggs or cheese
to go with it.
• • Regression analysis: It helps to predict the relationship between two variables. The
variable whose value needs to be predicted is called the dependent variable and the
variables which are used to predict the value are referred to as the independent
variables.
Dealing with Unstructured Data
• • Collaborative filtering: It is about predicting a user’s preference or
preferences based on the preferences of a group of users. For
example, take a look at Table 1.5. We are looking at predicting
whether User 4 will prefer to learn using videos or is a textual learner
depending on one or a couple of his or her known preferences. We
Dealing with Unstructured Data
• 2. Text analytics or text mining: Compared to the structured data stored in
relational databases, text is largely unstructured, amorphous, and difficult
to deal with algorithmically. Text mining is the process of gleaning high
quality and meaningful information (through devising of patterns and
trends by means of statistical pattern learning) from text. It includes tasks
such as text categorization, text clustering, sentiment analysis,
concept/entity extraction, etc.
• 3. Natural language processing (NLP): It is related to the area of human
computer interaction. It is about enabling computers to understand
human or natural language input.
• 4. Noisy text analytics: It is the process of extracting structured or semi-
structured information from noisy unstructured data such as chats, blogs,
wikis, emails, message-boards, text messages, etc.
• The noisy unstructured data usually comprises one or more of the
following: Spelling mistakes, abbreviations, acronyms, non-standard words,
missing punctuation, missing letter case, filler words such as “uh”, “um”,
etc.
Dealing with Unstructured Data
• 5. Manual tagging with metadata: This is about tagging manually with
adequate metadata to provide the requisite semantics to understand
unstructured data.
• 6. Part-of-speech tagging: It is also called POS or POST or grammatical
tagging. It is the process of reading text and tagging each word in the
sentence as belonging to a particular part of speech such as “noun”,
“verb”, “adjective”, etc.
• 7. Unstructured Information Management Architecture (UIMA): It is
an open source platform from IBM. It is used for real-time content
analytics. It is about processing text and other unstructured data
Introduction to Big Data
• Characteristics of Data
• As depicted in Figure 2.1, data has three key characteristics:
• 1. Composition: The composition of data deals with the structure of
data, that is, the sources of data, the granularity, the types, and the
nature of data as to whether it is static or real-time streaming.
• 2. Condition: The condition of data deals with the state of data, that
is, “Can one use this data as is for analysis?” or “Does it require
cleansing for further enhancement and enrichment?”
• 3. Context: The context of data deals with “Where has this data been
generated?” “Why was this data generated?” “How sensitive is this
data?” “What are the events associated with this data?” and so on.
Introduction to Big Data
• Evolution of Big Data
Big Data Definition
• Big data is high-volume, high-velocity, and high-variety
information assets that demand cost effective, innovative
forms of information processing for enhanced insight and
decision making.
Source: Gartner IT Glossary
Defintion contd..

Data → Information → Actionable intelligence → Better decisions


→ Enhanced business value
Challenges with Big Data
• Following are a few challenges with big data:
• 1. Data today is growing at an exponential rate. Most of the data that we have today has
been generated in the last 2–3 years. This high tide of data will continue to rise incessantly.
The key questions here are: “Will all this data be useful for analysis?”, “Do we work with all
this data or a subset of it?”, “How will we separate the knowledge from the noise?”, etc.
• 2. Cloud computing and virtualization are here to stay. Cloud computing is the answer to
managing infrastructure for big data as far as cost-efficiency, elasticity, and easy
upgrading/downgrading is concerned. This further complicates the decision to host big data
solutions outside the enterprise.
• 3. The other challenge is to decide on the period of retention of big data. Just how long
should one retain this data? A tricky question indeed as some data is useful for making
long-term decisions, whereas in few cases, the data may quickly become irrelevant and
obsolete just a few hours after having being generated.
Challenges with Big Data
• 4. There is a dearth of skilled professionals who possess a high level of proficiency
in data sciences that is vital in implementing big data solutions.
• 5. Then, of course, there are other challenges with respect to capture, storage,
preparation, search, analysis, transfer, security, and visualization of big data.
• Big data refers to datasets whose size is typically beyond the storage capacity of
traditional database software tools. There is no explicit definition of how big the
dataset should be for it to be considered “big data.” Here we are to deal with data
that is just too big, moves way to fast, and does not fit the structures of typical
database systems. The data changes are highly dynamic and therefore there is a
need to ingest this as quickly as possible.
• 6. Data visualization is becoming popular as a separate discipline. We are short by
quite a number, as far as business visualization experts are concerned.
Challenges with Big Data
What is Big Data?

• Big data is data that is big in volume, velocity, and variety. Refer Figure
2.5.

• Volume
• Velocity
• Variety
What is Big Data?

• Volume
• Sources of Big Data
• 1. Typical internal data sources: Data present within an organization’s firewall.
• • Data storage: File systems, SQL (RDBMSs – Oracle, MS SQL Server, DB2, MySQL, PostgreSQL, etc.), NoSQL
(MongoDB, Cassandra, etc.), and so on.
• • Archives: Archives of scanned documents, paper archives, customer correspondence records, patients’ health
records, students’ admission records, students’ assessment records, and so on.
• 2. External datasources: Data residing outside an organization’s firewall.
• • Public Web: Wikipedia, weather, regulatory, compliance, census, etc.
• 3. Both (internal + external data sources)
• • Sensor data: Car sensors, smart electric meters, office buildings, air conditioning units, refrigerators, and so on.
• • Machine log data: Event logs, application logs, Business process logs, audit logs, clickstream data, etc.
• • Social media: Twitter, blogs, Facebook, LinkedIn, YouTube, Instagram, etc.
• • Business apps: ERP, CRM, HR, Google Docs, and so on.
• • Media: Audio, Video, Image, Podcast, etc.
• • Docs: Comma separated value (CSV), Word Documents, PDF, XLS, PPT, and so on.
• it grow from bits to bytes to petabytes and exabytes
• Bits → Bytes → Kilobytes → Megabytes → Gigabytes → Terabytes → Petabytes → Exabytes → Zettabytes →
Yottabytes
What is Big Data
• Velocity
• Batch → Periodic → Near real time → Real-time processing
• Variety
• Variety Variety deals with a wide range of data types and sources of data. We
will study this under three categories: Structured data, semi-structured data
and unstructured data.
• 1. Structured data: From traditional transaction processing systems and
RDBMS, etc.
• 2. Semi-structured data: For example Hyper Text Markup Language (HTML),
eXtensible Markup Language (XML).
• 3. Unstructured data: For example unstructured text documents, audios,
videos, emails, photos, PDFs, social media, etc.
Other V’s of Big Data
• 1. Veracity and validity:
• Veracity refers to biases, noise, and abnormality in data. The key question here
is: “Is all the data that is being stored, mined, and analyzed meaningful and
pertinent to the problem under consideration?”
• Validity refers to the accuracy and correctness of the data. Any data that is picked
up for analysis needs to be accurate. It is not just true about big data alone.
• 2. Volatility: Volatility of data deals with, how long is the data valid? And how
long should it be stored? There is some data that is required for long-term
decisions and remains valid for longer periods of time. However, there are also
pieces of data that quickly become obsolete minutes after their generation.
• 3. Variability: Data flows can be highly inconsistent with periodic peaks. Picture
This… An online retailer announces the “big sale day” for a particular week. The
retailer is likely to experience an upsurge in customer traffic to the website
during this week. In the same way, he/she might experience a slump in his/her
business
Big Data Analytics
• Need for analytics
• Raw data is collected, classified, and organized. Associating it with adequate metadata
and laying bare the context converts this data into meaningful information. It is then
aggregated and summarized so that it becomes easy to consume it for analysis. Gradual
accumulation of such meaningful information builds a knowledge repository. This, in
turn, helps with actionable insights which prove useful for decision making. Refer Figure
3.1. (next slide).
• Organizations have realized that they will not be able to ignore big data if they want to
be competitive enough and make those timely decisions to make well of the fleeting
opportunities. They will have to analyze big time and also take into consideration big
data that makes it to the organization at unprecedented level in terms of volume,
velocity, and variety.
• Big data analytics is the process of examining big data to uncover patterns, unearth
trends, and find unknown correlations and other useful information to make faster and
better decisions. Analytics begin with analyzing all available data. Refer Figure 3.2.
Big Data Analytics
What is Big Data Analytics?
1. Technology-enabled analytics: Quite a few data analytics and visualization tools are
available in the market today from leading vendors such as IBM, Tableau, SAS, R Analytics,
Statistica, World Programming Systems (WPS), etc. to help process and analyze your big data.
2. About gaining a meaningful, deeper, and richer insight into your business to steer it in the
right direction, understanding the customer’s demographics to cross-sell and up-sell to them,
better leveraging the services of your vendors and suppliers, etc..
3. About a competitive edge over your competitors by enabling you with findings that allow
quicker and better decision-making.
4. A tight handshake between three communities: IT, business users, and data scientists. Refer
Figure 3.3.
5. Working with datasets whose volume and variety exceed the current storage and processing
capabilities and infrastructure of your enterprise.
6. About moving code to data. This makes perfect sense as the program for distributed
processing is tiny (just a few KBs) compared to the data (Terabytes or Petabytes today and
likely to be Exabytes or Zettabytes in the near future).
Why this Sudden Hype Around Big
Data Analytics?
• If we go by the industry buzz, every place there seems to be talk about big data and big data
analytics. Why this sudden hype?
• 1. Data is growing at a 40% compound annual rate, reaching nearly 45 ZB by 2020. In 2010,
almost about 1.2 trillion Gigabyte of data was generated. This amount doubled to 2.4 trillion
Gigabyte in 2012 and to about 5 trillion Gigabytes in the year 2014. The volume of business
data worldwide is expected to double every 1.2 years. Wal-Mart, the world retailer, processes
one million customer transactions per hour. 500 million “tweets” are posted by Twitter users
every day. 2.7 billion “Likes” and comments are posted by Facebook users in a day. Every day
2.5 quintillion bytes of data is created, with 90% of the world’s data created in the past 2
years alone
• 2. Cost per gigabyte of storage has hugely dropped.
• 3. There are an overwhelming number of user-friendly analytics tools available in the market
today.
CAP theorem
• The CAP theorem is also called the Brewer’s Theorem. It states that in
a distributed computing environment (a collection of interconnected
nodes that share data), it is impossible to provide the following
guarantees. Refer Figure 3.14. At best you can have two of the
following three – one must be sacrificed.
• 1. Consistency
• 2. Availability
• 3. Partition tolerance
CAP theorem
• 1. Consistency implies that every read fetches the last write.
• 2. Availability implies that reads and writes always succeed. In other
words, each non-failing node will return a response in a reasonable
amount of time.
• 3. Partition tolerance implies that the system will continue to
function when network partition occurs.
CAP Theorem
• When to choose consistency over availability and vice-versa…
• 1. Choose availability over consistency when your business requirements
allow some flexibility around when the data in the system synchronizes.
• 2. Choose consistency over availability when your business requirements
demand atomic reads and writes.
• Examples of databases that follow one of the possible three combinations:
• 1. Availability and Partition Tolerance (AP)
• 2. Consistency and Partition Tolerance (CP)
• 3. Consistency and Availability (CA)
• Refer Figure 3.15 to get a glimpse of databases that adhere to two of the
three characteristics of CAP theorem.
Basically Available Soft State
Eventual Consistency (BASE)
• In distributed computing it is used to achieve high availability.
• If no new updates are made to a given data item for a stipulated period of time, all
updates that were made in the past and not applied to this given data item and the
several replicas of it will percolate to this data item so that it stays as current/recent
as is possible.
• A system that has achieved eventual consistency is said to have converged or
achieved replica convergence.
• Conflict resolution: How is the conflict resolved?
• (a) Read repair: If the read leads to discrepancy or inconsistency, a correction is
initiated. It slows down the read operation.
• (b) Write repair: If the write leads to discrepancy or inconsistency, a correction is
initiated. This will cause the write operation to slow down.
• (c) Asynchronous repair: Here, the correction is not part of a read or write operation.
Applications of Big Data
• New & Big Data Cases

• Education
• Social media
• Entertainment
• Healthcare
• Marketing
• Customer behavior
• Smart Systems
• Refer the ppt and word document sent separately for applications and
scenarios of big data
Big data Landscape
• NOSQL
• Hadoop
Big data Landscape-NoSQL (Not
Only SQL)
• Few features of NoSQL databases are as follows:
• 1. They are open source.
• 2. They are non-relational.
• 3. They are distributed.
• 4. They are schema-less.
• 5. They are cluster friendly.
• 6. They are born out of 21st century web applications.
• NoSQL databases are widely used in big data and other real-time web
applications.
• Refer Figure 4.1. NoSQL databases is used to stock log data which can
then be pulled for analysis.
• Likewise it is used to store social media data and all such data which
cannot be stored and analyzed comfortably in RDBMS.
• NoSQL stands for Not Only SQL.
• These are non-relational, open source, distributed databases.
• They are hugely popular today owing to their ability to scale out or scale
horizontally and the adeptness at dealing with a rich variety of data: structured,
semi-structured and unstructured data.
• 1. Are non-relational: They do not adhere to relational data model. In fact, they are
either key–value pairs or document-oriented or column-oriented or graph-based
databases.
• 2. Are distributed: They are distributed meaning the data is distributed across several
nodes in a cluster constituted of low-cost commodity hardware.
• 3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and
Durability): They do not offer support for ACID properties of transactions. On the
contrary, they have adherence to Brewer’s CAP (Consistency, Availability, and Partition
tolerance) theorem and are often seen compromising on consistency in favor of
availability and partition tolerance.
• 4. Provide no fixed table schema: NoSQL databases are becoming increasing popular
owing to their support for flexibility to the schema. They do not mandate for the data
to strictly adhere to any schema structure at the time of storage.
Types of NoSQL Databases
• We have already stated that NoSQL databases are non-relational.
They can be broadly classified into the following:
• 1. Key−value or the big hash table.
• 2. Schema-less. Refer Figure 4.3. Let us take a closer look at
key−value and few other types of schema-less databases:
• 1. Key−value: It maintains a big hash table of keys and values. For
example, Dynamo, Redis, Riak, etc. Sample Key−Value Pair in
Key−Value Database NoSQL Key–value or the big hash table.
• 2. Document: It maintains data in collections constituted of
documents. For example, MongoDB,

• 3. Column: Each storage block has data from only one column. For
example: Cassandra, HBase, etc.
• Graph: They are also called network database. A graph stores data in
nodes. For example, Neo4j, HyperGraphDB, etc.
Why NoSQL?
• 1. It has scale out architecture instead of the monolithic architecture of
relational databases.
• 2. It can house large volumes of structured, semi-structured, and
unstructured data.
• 3. Dynamic schema: NoSQL database allows insertion of data without a pre-
defined schema. In other words, it facilitates application changes in real
time, which thus supports faster development, easy code integration, and
requires less database administration.
• 4. Auto-sharding: It automatically spreads data across an arbitrary number
of servers. The application in question is more often not even aware of the
composition of the server pool. It balances the load of data and query on
the available servers; and if and when a server goes down, it is quickly
replaced without any major activity disruptions.
• 5. Replication: It offers good support for replication which in turn
guarantees high availability, fault tolerance, and disaster recovery.
Advantages of NoSQL
• 1. Can easily scale up and down
• a)Cluster scale: It allows distribution of database across 100+ nodes often in
multiple data centers.
• (b) Performance scale: It sustains over 100,000+ database reads and writes
per second.
• (c) Data scale: It supports housing of 1 billion+ documents in the database.
• 2. Doesn’t require a pre-defined schema
• 3. Cheap, easy to implement
• 4. Relaxes the data consistency requirement
• 5. Data can be replicated to multiple nodes and can be partitioned
Advantages of NoSQL
• (a) Sharding: Sharding is when different pieces of data are distributed across
multiple servers. NoSQL databases support auto-sharding; this means that
they can natively and automatically spread data across an arbitrary number
of servers, without requiring the application to even be aware of the
composition of the server pool.
• Servers can be added or removed from the data layer without application
downtime. This would mean that data and query load are automatically
balanced across servers, and when a server goes down, it can be quickly and
transparently replaced with no application disruption.
• (b) Replication: Replication is when multiple copies of data are stored across
the cluster and even across data centers. This promises high availability and
fault tolerance.
Advantages of NoSQL
Disadvantages of NoSQL
Use of NOSQL in industry

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy