Lecture 16
Lecture 16
Lecture 16
Database Systems
Zubaira Naz
Information Technology University (ITU)
Introduction to Big Data
• What is Big data
– Data with huge size
– Collection of huge data
– Exponentially increasing with time
– Large and complex data
– Cannot be handle by Traditional MS
Motivation
• Very large volumes of data being collected
– Driven by growth of web, social media, and more
recently internet-of-things
– Web logs were an early source of data
• Analytics on web logs has great value for advertisements,
web site structuring, what posts to show to a user, etc
• Big Data: differentiated from data handled by
earlier generation databases
– Volume: much larger amounts of data stored
– Velocity: much higher rates of insertions
– Variety: many types of data, beyond relational data
Querying Big Data
• Transaction processing systems that need
very high scalability
– Many applications willing to sacrifice ACID
properties and other database features, if they
can get very high scalability
• Query processing systems that
– Need very high scalability, and
– Need to support non-relation data
Big Data Storage Systems
• Distributed file systems
• Sharding across multiple databases
• Key-value storage systems
• Parallel and distributed databases
Distributed File Systems
• A distributed file system stores data across a large
collection of machines, but provides single file-system
view
• Highly scalable distributed file system for large data-
intensive applications.
– E.g., 10K nodes, 100 million files
• Provides redundant storage of massive amounts of data on
cheap and unreliable computers
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
• Examples:
– Google File System (GFS)
– Hadoop File System (HDFS)
Sharding
• Sharding: partition data across multiple databases
• Partitioning usually done on some partitioning attributes
(also known as partitioning keys or shard keys e.g. user
ID
– E.g., records with key values from 1 to 100,000 on database 1,
records with key values from 100,001 to 200,000 on database 2,
etc.
• Application must track which records are on which
database and send queries/updates to that database
• Positives: scales well, easy to implement
Sharding
• Drawbacks:
– Not transparent: application has to deal with routing of
queries, queries that span multiple databases
– When a database is overloaded, moving part of its load
out is not easy
– Chance of failure more with more databases
• need to keep replicas to ensure availability, which is more
work for application
Key Value Storage Systems
• Key-value storage systems store large numbers (billions
or even more) of small (KB-MB) sized records
• Records are partitioned across multiple machines and
• Queries are routed by the system to appropriate machine
• Records are also replicated across multiple machines, to
ensure availability even if a machine fails
– Key-value stores ensure that updates are applied to all
replicas, to ensure that their values are consistent
Parallel and Distributed Databases
• Parallel databases run multiple machines (cluser)
– Developed in 1980s, well before Big Data
• Parallel databases were designed for smaller scale
(10s to 100s of machines)
– Did not provide easy scalability
• Replication used to ensure data availability despite
machine failure
– But typically restart query in event of failure
• Restarts may be frequent at very large scale
• Map-reduce systems (coming up next) can continue query
execution, working around failures
Challenges of Big data
• Storage: With vast amounts of data generated daily,
the greatest challenge is storage (especially when the
data is in different formats) within legacy systems.
Unstructured data cannot be stored in traditional
databases.
• Processing: big data refers to the reading,
transforming, extraction, and formatting of useful
information from raw information
• Security is a big concern for organizations. Non-
encrypted information is at risk of theft or damage by
cyber-criminals.
Challenges of Big data
• Scaling Big Data Systems: Database sharding, memory
caching, moving to the cloud and separating read-only and write-active
databases are all effective scaling methods.
• Evaluating and Selecting Big Data Technologies:
Companies are spending millions on new big data technologies, and the
market for such tools is expanding rapidly.
• Big Data Environments
• Real-Time Insights
• Data Validation
Questions
1. How do distributed file systems handle data locality to improve access speed?
2. What are common failure scenarios in distributed file systems, and how are
they mitigated?
3. What are the primary challenges in maintaining data consistency and
transaction integrity in a sharded database environment?
4. How does automated vs. manual sharding differ, and what are the pros and
cons of each approach?
5. What are the main considerations for choosing a hash function in key-value
storage to ensure balanced data distribution?
6. How do key-value stores handle complex data types and relationships, given
their flat storage model?
7. What mechanisms are used to ensure fault tolerance and high availability in
key-value storage systems?
8. How do parallel databases manage query optimization across multiple
processing units?
Ch11: Data Analytics
Overview
• Data analytics: the processing of data to
infer patterns, correlations, or models for
prediction
• Primarily used to make business decisions
– Per individual customer
• E.g., what product to suggest for purchase
– Across all customers
• E.g., what products to manufacture/stock, in what
quantity
• Critical for businesses today
Overview (Cont.)
• Common steps in data analytics
– Gather data from multiple sources into one location
• Data warehouses also integrated data into common schema
• Data often needs to be extracted from source formats, transformed to
common schema, and loaded into the data warehouse
– Can be done as ETL (extract-transform-load), or ELT (extract-load-
transform)
– Generate aggregates and reports summarizing data
• Dashboards showing graphical charts/reports
• Online analytical processing (OLAP) systems allow interactive
querying
• Statistical analysis using tools such as R/SAS/SPSS
– Including extensions for parallel processing of big data
– Build predictive models and use the models for decision making
Overview (Cont.)
• Predictive models are widely used today
– E.g., use customer profile features (e.g. income, age, gender,
education, employment) and past history of a customer to predict
likelihood of default on loan
• and use prediction to make loan decision
– E.g., use past history of sales (by season) to predict future sales
• And use it to decide what/how much to produce/stock
• And to target customers
• Other examples of business decisions:
– What items to stock?
– What insurance premium to change?
– To whom to send advertisements?
Overview (Cont.)
• Machine learning techniques are key to finding
patterns in data and making predictions
• Data mining extends techniques developed by
machine-learning communities to run them on
very large datasets
• The term business intelligence (BI) is synonym
for data analytics
• The term decision support focuses on reporting
and aggregation
Data Warehousing
• Data sources often store only current data, not
historical data
• Corporate decision making requires a unified view
of all organizational data, including historical data
• A data warehouse is a repository (archive) of
information gathered from multiple sources, stored
under a unified schema, at a single site
– Greatly simplifies querying, permits study of historical
trends
– Shifts decision support query load away from
transaction processing systems
Data Warehousing
OLTP & OLAP
• It is an Operation-based system
• Administers day to day transactions of an organization
• Includes insertions, updates and deletions
• Primary objective is data processing
• OLTP allows multiple users to access and change the same data
which many times creates unprecedented situations.
OLAP
ONLINE ANALTICAL PROCESSING