Unit 1: Data Warehousing & Data Mining
Unit 1: Data Warehousing & Data Mining
Unit 1: Data Warehousing & Data Mining
— Unit 1 —
1
Why Data Mining?
Task-relevant Data
Data Cleaning
Data Integration
Databases 6
Example: A Web Mining
Framework
• Web mining usually involves
– Data cleaning
– Data integration from multiple sources
– Warehousing the data
– Data cube construction
– Data selection for data mining
– Data mining
– Presentation of the mining results
– Patterns and knowledge to be used or stored
into knowledge-base
7
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
9
KDD Process: A Typical View from ML and
Statistics
11
Multi-Dimensional View of Data Mining
• Data to be mined
– Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web,
multi-media, graphs & social and information networks
• Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance,
etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.12
Data Mining: On What Kinds of Data?
17
Data Mining Function: (5) Outlier Analysis
• Outlier analysis
– Outlier: A data object that does not comply with the
general behavior of the data
– Noise or exception? ― One person’s garbage could be
another person’s treasure
– Methods: by product of clustering or regression analysis, …
– Useful in fraud detection, rare events analysis
18
Time and Ordering: Sequential Pattern, Trend and
Evolution Analysis
20
Evaluation of Knowledge
• Are all mined knowledge interesting?
– One can mine tremendous amount of “patterns” and
knowledge
– Some may fit only certain dimension space (time, location, …)
– Some may not be representative, may be transient, …
• Evaluation of mined knowledge → directly mine only
interesting knowledge?
– Descriptive vs. predictive
– Coverage
– Typicality vs. novelty
– Accuracy
– Timeliness
– …
21
Data Mining: Confluence of Multiple Disciplines
22
Why Confluence of Multiple Disciplines?
24
Major Issues in Data Mining (1)
• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: An interdisciplinary effort
– Boosting the power of discovery in a networked
environment
– Handling noise, uncertainty, and incompleteness of data
– Pattern evaluation and pattern- or constraint-guided
mining
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results 25
Major Issues in Data Mining (2)
26
What is a Data Warehouse?
27
Data Warehouse—Subject-
Oriented
• Organized around major subjects, such as
customer, product, sales
• Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing
• Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process
28
Data Warehouse—Integrated
• Constructed by integrating multiple,
heterogeneous data sources
– relational databases, flat files, on-line
transaction records
• Data cleaning and data integration techniques
are applied.
– Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
• E.g., Hotel price: currency, tax, breakfast covered,
etc.
– When data is moved to the warehouse, it is
converted. 29
Data Warehouse—Time
Variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Operational database: current value data
– Data warehouse data: provide information
from a historical perspective (e.g., past 5-10
years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or
implicitly
– But the key of operational data may or may 30
Data Warehouse—Nonvolatile
• A physically separate store of data transformed
from the operational environment
• Operational update of data does not occur in the
data warehouse environment
– Does not require transaction processing,
recovery, and concurrency control
mechanisms
– Requires only two operations in data
accessing:
• initial loading of data and access of data
31
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
32
Why a Separate Data Warehouse?
Monitor
Metadata & OLAP Server
Other
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
• Enterprise warehouse
– collects all of the information about subjects spanning
the entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
35
Extraction, Transformation, and Loading
(ETL)
• Data extraction
– get data from multiple, heterogeneous, and external
sources
• Data cleaning
– detect errors in the data and rectify them when
possible
• Data transformation
– convert data from legacy or host format to
warehouse format
• Load
– sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
• Refresh
– propagate the updates from the data sources to the
warehouse
36
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
– schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
• Operational meta-data
– data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
– warehouse schema, view and derived data definitions
• Business data
– business terms and definitions, ownership of data, charging
policies
37
Chapter 4: Data Warehousing and On-line
Analytical Processing
38
From Tables and Spreadsheets to
Data Cubes
all
0-D (apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
40
Conceptual Modeling of Data
Warehouses
• Modeling data warehouses: dimensions &
measures
– Star schema: A fact table in the middle
connected to a set of dimension tables
– Snowflake schema: A refinement of star
schema where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
– Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of
41
stars, therefore called galaxy schema or fact
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
42
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
43
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
all all
45
Multidimensional Data
Office Day
Month
46
A Sample Data Cube
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
47
Cuboids Corresponding to the
Cube
all
0-D (apex) cuboid
product date country
1-D cuboids
48
Typical OLAP Operations
49
Fig. 3.10 Typical OLAP
Operations
50
Data Warehouse Usage
• Three kinds of data warehouse applications
– Information processing
• supports querying, basic statistical analysis, and
reporting using crosstabs, tables, charts and graphs
– Analytical processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations, slice-dice, drilling,
pivoting
– Data mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models,
performing classification and prediction, and
presenting the mining results using visualization tools51
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
52
OLAP Server Architectures