Digitization Week 3
Digitization Week 3
Digitization Week 3
Examples:
• Bank
• Hospital
• Plethora of drawbacks
1. Data redundancy and inconsistency
◦ Multiple data formats, duplication in different files
2. Difficulty in accessing data
◦ Need to write a new program to carry out each new task
3. Data isolation
◦ Multiple files and formats
4. Integrity problems
◦ Integrity constraints (e.g., account balance > 0) become “buried” in program
code rather than being stated explicitly ◦ Hard to add new constraints or change
existing ones
5. Atomicity of updates
◦ Example: Transfer of funds from one account to another should either be
completed or not happen at all
◦ Failures may leave data in an inconsistent state with partial updates carried
out
6. Concurrent access by multiple users
◦ Needed for performance ◦ Example: Two people reading a balance (e.g.,
100) and then withdrawing money (e.g., 50 each) at the same time
◦ Uncontrolled concurrent accesses can lead to inconsistencies
7. Security problems
◦ Hard to provide user access to some, but not all, data
Big Data
• Information assets that require new forms of processing to enable enhanced decision
making and insight discovery
• Definition of Big Data expressed through Vs
1. Volume:
◦ Amount of generated and stored data ◦ Wikipedia 6.5M Eng. articles, users,
other languages, etc.
2. Velocity:
◦ Rate/Speed at which the data is generated, received, collected, and
(perhaps) processed
◦ Wikipedia e.g., 6000 editors have more than 100 edits per month
over the English articles
3. Variety:
◦ Different types of data that are available
◦ RDBMS: structured and neatly fit
◦ Web systems, e.g., Wikipedia: unstructured and semistructured data types,
such as text, audio, and video
◦ Requires additional preprocessing to derive meaning and support metadata
4. Veracity:
◦ Quality of captured data
◦ Truthful of the data and how much we can rely on it
◦ Low veracity → high percentage of meaningless data (e.g., noise)
5. Value:
◦ Refers to the inherent wealth (i.e., economic and social) embedded in the
data
◦ Consider biggest tech. companies large part of their value comes from their
data, which they’re constantly analyzing to improve efficiency & develop new
products
Data Integration
• Entities encode a large part of our knowledge
• Valuable asset for numerous current applications and (Web) systems
Entity Resolution
• Task that identifies and aggregates the different descriptions that refer to the same
real-world objects
• Primary usefulness:
◦ Improves data quality and integrity
◦ Fosters re-use of existing data sources
• Example application domains:
◦ Linked Data
◦ Building Knowledge Graphs
◦ Census data
◦ Price comparison portals
Data Management
• Challenges arise from the application settings • Examples: ◦ Data characteristics ◦ System
and resources ◦ Time restrictions • Evolving nature of the application settings implies a
constant modification of the challenges • Primary reason for the plethora of the Entity
Resolution methods
Challenges
Veracity
• Structured data with known semantics and quality
• Dealing with high levels of description noise
+ Volume
• Very large number of description
+ Variety
• Large volumes of semi-structured, unstructured or highly heterogeneous structured
data
+ Velocity
• Increasing volume of available data
Challenges in time
Big Data refers to the inherent wealth, economic and social, embedded in any data collection
- Data storage
- Finding the needle in the haystack
- Data processing
- Scalability
Distributed databases
• Innovation:
◦ Add more DBSMs & partition the data
• Constrained functionality:
◦ Answer SQL queries
• Efficiency limited by network, #servers
• API offers location transparency
◦ User/application always sees a single machine
◦ User/application not caring about data location
• Scaling: add more/better servers, faster network
Cloud
• Massively parallel processing platforms running over rented hardware
• Innovation: Elasticity, standardization
◦ Amazon requires huge computational capacity near holidays
◦ University requires very little resources during holidays
• Elasticity can be automatically adjusted
• API offers location and parallelism transparency
• Scaling: it’s magic!
Big Data models
Store, Manage, and Process of Big Data by harnessing large clusters of commodity nodes
• MapReduce family: simpler, more constrained
• 2nd generation: enables more complex processing and data, optimization opportunities
- Apache spark, Google Pregel, Microsoft Dryad
Analytics
• Traditional computation (e.g., SQL):
◦ Exact and all answers over the whole data collection
Interactive processing:
• Users give an opinion
• Thus:
- Users understand the problem
- Users influence decisions
• ER: system users are asked to help during the processing, i.e., their answers are
considered as part of the algorithm
Crowdsourcing processing:
• Difficult tasks or opinions in the processing are given to a group of people
• ER: humans are asked about the relation between profiles for a small compensation per
reply
Approximate processing:
• Use a representative sample instead of the entire input data collection
• Give approximate output and not exact answers
• Answers given with quarantines
• ER: profiles are the same with 95% certainty
Progressive processing:
• Efficiently process given limited time and/or computational resources that currently are
available
• ER: results are shown as soon as there are available
Incremental processing:
• Data updates is often high, which quickly makes the previous results obsolete
• Update existing processing information • Allow leveraging new evidence from updates to:
• Fix previous inconsistencies or
• Complete the information