Lecture 16

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Lecture 16

Ch10: Big Data

Database Systems
Zubaira Naz
Information Technology University (ITU)
Introduction to Big Data
• What is Big data
– Data with huge size
– Collection of huge data
– Exponentially increasing with time
– Large and complex data
– Cannot be handle by Traditional MS
Motivation
• Very large volumes of data being collected
– Driven by growth of web, social media, and more
recently internet-of-things
– Web logs were an early source of data
• Analytics on web logs has great value for advertisements,
web site structuring, what posts to show to a user, etc
• Big Data: differentiated from data handled by
earlier generation databases
– Volume: much larger amounts of data stored
– Velocity: much higher rates of insertions
– Variety: many types of data, beyond relational data
Querying Big Data
• Transaction processing systems that need
very high scalability
– Many applications willing to sacrifice ACID
properties and other database features, if they
can get very high scalability
• Query processing systems that
– Need very high scalability, and
– Need to support non-relation data
Big Data Storage Systems
• Distributed file systems
• Sharding across multiple databases
• Key-value storage systems
• Parallel and distributed databases
Distributed File Systems
• A distributed file system stores data across a large
collection of machines, but provides single file-system
view
• Highly scalable distributed file system for large data-
intensive applications.
– E.g., 10K nodes, 100 million files
• Provides redundant storage of massive amounts of data on
cheap and unreliable computers
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Examples:
– Google File System (GFS)
– Hadoop File System (HDFS)
Sharding
• Sharding: partition data across multiple databases
• Partitioning usually done on some partitioning attributes
(also known as partitioning keys or shard keys e.g. user
ID
– E.g., records with key values from 1 to 100,000 on database 1,
records with key values from 100,001 to 200,000 on database 2,
etc.
• Application must track which records are on which
database and send queries/updates to that database
• Positives: scales well, easy to implement
Sharding
• Drawbacks:
– Not transparent: application has to deal with routing of
queries, queries that span multiple databases
– When a database is overloaded, moving part of its load
out is not easy
– Chance of failure more with more databases
• need to keep replicas to ensure availability, which is more
work for application
Key Value Storage Systems
• Key-value storage systems store large numbers (billions
or even more) of small (KB-MB) sized records
• Records are partitioned across multiple machines and
• Queries are routed by the system to appropriate machine
• Records are also replicated across multiple machines, to
ensure availability even if a machine fails
– Key-value stores ensure that updates are applied to all
replicas, to ensure that their values are consistent
Parallel and Distributed Databases
• Parallel databases run multiple machines (cluser)
– Developed in 1980s, well before Big Data
• Parallel databases were designed for smaller scale
(10s to 100s of machines)
– Did not provide easy scalability
• Replication used to ensure data availability despite
machine failure
– But typically restart query in event of failure
• Restarts may be frequent at very large scale
• Map-reduce systems (coming up next) can continue query
execution, working around failures
Challenges of Big data
• Storage: With vast amounts of data generated daily,
the greatest challenge is storage (especially when the
data is in different formats) within legacy systems.
Unstructured data cannot be stored in traditional
databases.
• Processing: big data refers to the reading,
transforming, extraction, and formatting of useful
information from raw information
• Security is a big concern for organizations. Non-
encrypted information is at risk of theft or damage by
cyber-criminals.
Challenges of Big data
• Scaling Big Data Systems: Database sharding, memory
caching, moving to the cloud and separating read-only and write-active
databases are all effective scaling methods.
• Evaluating and Selecting Big Data Technologies:
Companies are spending millions on new big data technologies, and the
market for such tools is expanding rapidly.
• Big Data Environments
• Real-Time Insights
• Data Validation
Questions
1. How do distributed file systems handle data locality to improve access speed?
2. What are common failure scenarios in distributed file systems, and how are
they mitigated?
3. What are the primary challenges in maintaining data consistency and
transaction integrity in a sharded database environment?
4. How does automated vs. manual sharding differ, and what are the pros and
cons of each approach?
5. What are the main considerations for choosing a hash function in key-value
storage to ensure balanced data distribution?
6. How do key-value stores handle complex data types and relationships, given
their flat storage model?
7. What mechanisms are used to ensure fault tolerance and high availability in
key-value storage systems?
8. How do parallel databases manage query optimization across multiple
processing units?
Ch11: Data Analytics
Overview
• Data analytics: the processing of data to
infer patterns, correlations, or models for
prediction
• Primarily used to make business decisions
– Per individual customer
• E.g., what product to suggest for purchase
– Across all customers
• E.g., what products to manufacture/stock, in what
quantity
• Critical for businesses today
Overview (Cont.)
• Common steps in data analytics
– Gather data from multiple sources into one location
• Data warehouses also integrated data into common schema
• Data often needs to be extracted from source formats, transformed to
common schema, and loaded into the data warehouse
– Can be done as ETL (extract-transform-load), or ELT (extract-load-
transform)
– Generate aggregates and reports summarizing data
• Dashboards showing graphical charts/reports
• Online analytical processing (OLAP) systems allow interactive
querying
• Statistical analysis using tools such as R/SAS/SPSS
– Including extensions for parallel processing of big data
– Build predictive models and use the models for decision making
Overview (Cont.)
• Predictive models are widely used today
– E.g., use customer profile features (e.g. income, age, gender,
education, employment) and past history of a customer to predict
likelihood of default on loan
• and use prediction to make loan decision
– E.g., use past history of sales (by season) to predict future sales
• And use it to decide what/how much to produce/stock
• And to target customers
• Other examples of business decisions:
– What items to stock?
– What insurance premium to change?
– To whom to send advertisements?
Overview (Cont.)
• Machine learning techniques are key to finding
patterns in data and making predictions
• Data mining extends techniques developed by
machine-learning communities to run them on
very large datasets
• The term business intelligence (BI) is synonym
for data analytics
• The term decision support focuses on reporting
and aggregation
Data Warehousing
• Data sources often store only current data, not
historical data
• Corporate decision making requires a unified view
of all organizational data, including historical data
• A data warehouse is a repository (archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site
– Greatly simplifies querying, permits study of historical
trends
– Shifts decision support query load away from
transaction processing systems
Data Warehousing
OLTP & OLAP

• OLTP and OLAP both are online processing systems.


• OLTP is a transactional processing while OLAP is an analytical
processing system.

• OLTP is a system that manages transaction-oriented


applications on the internet for example, ATM.
– Day to day transactions
• OLAP is an online system that reports to multidimensional
analytical queries like financial reporting, forecasting, etc.
– Infers information/trends from historical data
OLTP
ONLINE TRANSACTION PROCESSING

• It is an Operation-based system
• Administers day to day transactions of an organization
• Includes insertions, updates and deletions
• Primary objective is data processing

• The main emphasis for OLTP systems is based on very


– fast query processing
– maintaining data integrity in multi access environment
– effectiveness measured by number of transactions per second
OLTP: Example

• Two people having a joint account, go to different ATM centers


simultaneously, and withdraw total amount of money.
– OLTP will make sure total amount of money withdrawn is never more
than the money deposited in the account.
– Money will be drawn on a first come first serve basis

• Other examples, online banking, online ticket booking, adding items to


shopping cart.

• OLTPs are optimized for transactional superiority instead of data analysis.


Queries that an OLTP system
can Process
OLTP system is an online database changing system. Therefore, it supports
database query such as insert, update, and delete information from the
database.
– Consider a point-of-sale (POS) system of a supermarket, following are
the sample queries that this system can process:
• Retrieving the description of a particular product.
• Filtering all products related to the supplier.
• Searching the record of the customer.
• Listing products having a price less than the expected amount.
OLTP
Disadvantage

• Incase of a hardware failure, OLTP systems gets


severally affected.
– Transactions are affected, may be halted.

• OLTP allows multiple users to access and change the same data
which many times creates unprecedented situations.
OLAP
ONLINE ANALTICAL PROCESSING

• Provides analysis of data for business decisions.


• Primary objective is not just data processing but also data
analysis.
• OLAP is an online historical multidimensional data retrieval
system, which retrieves the data for analysis that can help in
decision making.
• OLAP enables a user to easily and selectively extract and view
data from different points of view.
• OLAP allows users to analyze database information from
multiple database systems at one time.
OLAP
Example

• Amazon analysis purchases made by its customers to come up


with a personalized page, with products that are liked by their
customers.
• A company might compare sales in the month of February, with
the month of January. Then compare those results with another
location, stored in a separate DB.
OLAP
Advantages

• Creates a single platform for all types of business analytical


needs
• Consistency of information and calculation
• Easily apply restriction on users to comply with regulation and
protect sensitive data.
OLAP
Disadvantages

• Implementation and maintenance is difficult and dependent on


IT professionals, complicated modelling process
• Need cooperation between people of various departments to be
effective, which may be difficult to do so
Assignment
1. Explain the difference between descriptive, predictive, and prescriptive
analytics.
2. Define a data warehouse and explain its key characteristics.?
3. What is the difference between a data warehouse and a traditional database?
4. Describe the ETL (Extract, Transform, Load) process in the context of data
warehousing.?
5. What are some common applications of data warehousing in business?
6. Compare and contrast star schema and snowflake schema in data warehouse
design?
7. Describe the CRISP-DM methodology used in data mining.What are some
common techniques used in data mining (e.g., classification, clustering,
association rule mining)?
8. What is OLAP, and how does it differ from Online Transaction Processing
(OLTP)?
9. Explain the concept of OLAP cubes and how they enhance data analysis?
10. What are the different types of OLAP (e.g., MOLAP, ROLAP, HOLAP), and
how do they differ?
11. How do slicing, dicing, and pivoting operations work in OLAP?
12. What are some typical use cases of OLAP in business intelligence?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy