Digitization Week 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Evolution of Data Management

Database Management Systems


Early days → DBMS
• Collecting, storing, processing and retrieving data
• Application built on top of file systems

Examples:
• Bank
• Hospital

Redundancies → same info in more than one places


Inconsistencies → different values for the same info

• Plethora of drawbacks
1. Data redundancy and inconsistency
◦ Multiple data formats, duplication in different files
2. Difficulty in accessing data
◦ Need to write a new program to carry out each new task
3. Data isolation
◦ Multiple files and formats
4. Integrity problems
◦ Integrity constraints (e.g., account balance > 0) become “buried” in program
code rather than being stated explicitly ◦ Hard to add new constraints or change
existing ones
5. Atomicity of updates
◦ Example: Transfer of funds from one account to another should either be
completed or not happen at all
◦ Failures may leave data in an inconsistent state with partial updates carried
out
6. Concurrent access by multiple users
◦ Needed for performance ◦ Example: Two people reading a balance (e.g.,
100) and then withdrawing money (e.g., 50 each) at the same time
◦ Uncontrolled concurrent accesses can lead to inconsistencies
7. Security problems
◦ Hard to provide user access to some, but not all, data

Relational Database Management Systems


Following Developments → RDBMS
◦ Designed to take care of DBMS drawbacks /inefficiencies
◦ Data is stored in the form of tables
◦ Maintaining the relationships among tables
◦ Large data sizes, distribution, many users, multiple levels of data security supports
integrity constraints, etc.
Internet growth
- The example of Wikipedia
• Free content encyclopedia
• Among the popular websites
• Written/maintained: community of volunteer contributors

• Various actions given to average Web user!


• Create new articles • Extend existing articles
• Translate to other languages
• Events appear within minutes

Big Data and its challenges

• Reality: ever-increasing data, demanding users

Big Data
• Information assets that require new forms of processing to enable enhanced decision
making and insight discovery
• Definition of Big Data expressed through Vs
1. Volume:
◦ Amount of generated and stored data ◦ Wikipedia 6.5M Eng. articles, users,
other languages, etc.
2. Velocity:
◦ Rate/Speed at which the data is generated, received, collected, and
(perhaps) processed
◦ Wikipedia e.g., 6000 editors have more than 100 edits per month
over the English articles
3. Variety:
◦ Different types of data that are available
◦ RDBMS: structured and neatly fit
◦ Web systems, e.g., Wikipedia: unstructured and semistructured data types,
such as text, audio, and video
◦ Requires additional preprocessing to derive meaning and support metadata
4. Veracity:
◦ Quality of captured data
◦ Truthful of the data and how much we can rely on it
◦ Low veracity → high percentage of meaningless data (e.g., noise)
5. Value:
◦ Refers to the inherent wealth (i.e., economic and social) embedded in the
data
◦ Consider biggest tech. companies large part of their value comes from their
data, which they’re constantly analyzing to improve efficiency & develop new
products

Even more Big Data Characteristics


• Visualization:
◦ Display the data
◦ Technical issues due to limitations of in-memory technology, scalability,
response time, etc.
• Volatility:
◦ Everything changes … thus we always need to be if data is now irrelevant,
historic, or just not useful
• Vulnerability:
◦ New security concerns
Example

Data Integration
• Entities encode a large part of our knowledge
• Valuable asset for numerous current applications and (Web) systems

• Plethora of different objects have the same name


• Example: London

Entity Resolution
• Task that identifies and aggregates the different descriptions that refer to the same
real-world objects
• Primary usefulness:
◦ Improves data quality and integrity
◦ Fosters re-use of existing data sources
• Example application domains:
◦ Linked Data
◦ Building Knowledge Graphs
◦ Census data
◦ Price comparison portals
Data Management

• Challenges arise from the application settings • Examples: ◦ Data characteristics ◦ System
and resources ◦ Time restrictions • Evolving nature of the application settings implies a
constant modification of the challenges • Primary reason for the plethora of the Entity
Resolution methods

Challenges
Veracity
• Structured data with known semantics and quality
• Dealing with high levels of description noise
+ Volume
• Very large number of description
+ Variety
• Large volumes of semi-structured, unstructured or highly heterogeneous structured
data
+ Velocity
• Increasing volume of available data
Challenges in time

Big Data refers to the inherent wealth, economic and social, embedded in any data collection
- Data storage
- Finding the needle in the haystack
- Data processing
- Scalability

Architectural choices to consider


• Storage layer
• Programming model & execution engine
• Scheduling
• Optimizations
• Fault tolerance
• Load balancing
Scalability in data management (Chronological order)
Traditional databases
◦ Constrained functionality: SQL only
◦ Efficiency limited by server capacity
- Memory
- CPU (central processing unit)
- HDD (hard disk drive)
- Network
• Scaling can be done by
◦ Adding more hardware
◦ Creating better algorithms
- But we still can reach the limits

Distributed databases
• Innovation:
◦ Add more DBSMs & partition the data
• Constrained functionality:
◦ Answer SQL queries
• Efficiency limited by network, #servers
• API offers location transparency
◦ User/application always sees a single machine
◦ User/application not caring about data location
• Scaling: add more/better servers, faster network

Massively parallel processing platforms


• Innovation:
◦ Connect computers (nodes) over a LAN & make development, parallelization, and
robustness easy
• Functionality:
◦ Generic data-intensive computing
• Efficiency relies on network, #computers, and algorithms
• API offers location & parallelism transparency
◦ Developers don’t know where data is stored and how the code will be parallelized
• Scaling: ◦ Add more and/or better computers

Cloud
• Massively parallel processing platforms running over rented hardware
• Innovation: Elasticity, standardization
◦ Amazon requires huge computational capacity near holidays
◦ University requires very little resources during holidays
• Elasticity can be automatically adjusted
• API offers location and parallelism transparency
• Scaling: it’s magic!
Big Data models
Store, Manage, and Process of Big Data by harnessing large clusters of commodity nodes
• MapReduce family: simpler, more constrained

• 2nd generation: enables more complex processing and data, optimization opportunities
- Apache spark, Google Pregel, Microsoft Dryad

Big Data Analytics (according to IBM)\


• Driven by artificial intelligence, mobile devices, social media and the Internet of Things
(IoT)
• Data sources are becoming more complex than those for traditional data
◦ e.g., Web applications allow user generated data
- Deliver deeper insights
- Power innovative data applications
- Better and faster decision-making
- Predicting future outcomes
- Enhanced business intelligence

Analytics
• Traditional computation (e.g., SQL):
◦ Exact and all answers over the whole data collection

Interactive processing:
• Users give an opinion
• Thus:
- Users understand the problem
- Users influence decisions
• ER: system users are asked to help during the processing, i.e., their answers are
considered as part of the algorithm

Crowdsourcing processing:
• Difficult tasks or opinions in the processing are given to a group of people
• ER: humans are asked about the relation between profiles for a small compensation per
reply

Approximate processing:
• Use a representative sample instead of the entire input data collection
• Give approximate output and not exact answers
• Answers given with quarantines
• ER: profiles are the same with 95% certainty

Progressive processing:
• Efficiently process given limited time and/or computational resources that currently are
available
• ER: results are shown as soon as there are available

Incremental processing:
• Data updates is often high, which quickly makes the previous results obsolete
• Update existing processing information • Allow leveraging new evidence from updates to:
• Fix previous inconsistencies or
• Complete the information

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy