02-Introduction to Data Mining
02-Introduction to Data Mining
data mining
Jiawei Han
1
Overview
◼ Why Data Mining?
◼ Summary
2
Why Data Mining?
3
Evolution of Sciences
component.
• Theoretical models often motivate
experiments and generalize our
understanding.
4
Evolution of Sciences
simulation.
◼ It grew out of our inability to find closed-form
5
Evolution of Sciences
◼ 1990-now, data science
◼ The flood of data from new scientific instruments and
simulations
◼ The ability to economically store and manage petabytes
of data online
◼ The Internet and computing Grid that makes all these
6
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network
DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational,
OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific,
engineering, etc.)
7
Evolution of Database Technology
◼ 1990s:
◼ Data mining, data warehousing, multimedia
databases, and Web databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and
global information systems
8
What Is Data Mining?
9
Knowledge Discovery (KDD) Process
◼ This is a view from typical
database systems and data
Pattern Evaluation
warehousing communities
◼ Data mining plays an essential
role in the knowledge discovery
process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
10
Example: A Web Mining Framework
11
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
13
KDD Process: A Typical View from ML and
Statistics
14
Example: Medical Data Mining
15
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented,
levels
16
Multi-Dimensional View of Data Mining
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP),
17
Data Mining: On What Kinds of Data?
◼ Database-oriented data sets and applications
◼ Relational database, data warehouse, transactional database
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web
18
Data Mining Function: (1) Generalization
19
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ What items are frequently purchased together
in your Walmart?
◼ Association, correlation vs. causality
◼ A typical association rule
◼ Diaper → Beer [0.5%, 75%] (support,
confidence)
◼ Are strongly associated items also strongly
correlated?
20
Data Mining Function: (2) Association and
Correlation Analysis
◼ How to mine such patterns and rules
efficiently in large datasets?
◼ How to use such patterns for classification,
clustering, and other applications?
21
Data Mining Function: (3) Classification
22
Data Mining Function: (3) Classification
◼ Typical methods
◼ Decision trees, naïve Bayesian classification,
support vector machines, neural networks, rule-
based classification, pattern-based
classification, logistic regression, …
◼ Typical applications:
◼ Credit card fraud detection, direct marketing,
classifying stars, diseases, web-pages, …
23
Data Mining Function: (4) Cluster Analysis
24
Data Mining Function: (5) Outlier Analysis
◼ Outlier analysis
◼ Outlier: A data object that does not comply with
the general behavior of the data
◼ Noise or exception? ― One person’s garbage
could be another person’s treasure
◼ Methods: by product of clustering or regression
analysis, …
◼ Useful in fraud detection, rare events analysis
25
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
◼ Sequence, trend and evolution analysis
◼ Trend, time-series, and deviation analysis: e.g., regression
cards
◼ Periodicity analysis
◼ Similarity-based analysis
26
Structure and Network Analysis
◼ Graph mining
◼ Finding frequent subgraphs (e.g., chemical
relationships (edges)
◼ e.g., author networks in CS, terrorist networks
27
Structure and Network Analysis
◼ Web mining
◼ Web is a big information network: from
PageRank to Google
◼ Analysis of Web information networks
usage mining, …
28
Evaluation of Knowledge
29
Evaluation of Knowledge
30
Data Mining: Confluence of Multiple Disciplines
31
Why Confluence of Multiple Disciplines?
◼ Tremendous amount of data
◼ Algorithms must be highly scalable to handle such as tera-bytes of data
◼ High-dimensionality of data
◼ Micro-array may have tens of thousands of dimensions
◼ High complexity of data
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data
◼ Structure data, graphs, social networks and multi-linked data
◼ Heterogeneous databases and legacy databases
◼ Spatial, spatiotemporal, multimedia, text and Web data
◼ Software programs, scientific simulations
◼ New and sophisticated applications
32
Applications of Data Mining
33
Applications of Data Mining
34
Major Issues in Data Mining (1)
◼ Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Mining knowledge in multi-dimensional space
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked
environment
◼ Handling noise, uncertainty, and incompleteness of
data
◼ Pattern evaluation and pattern- or constraint-guided
mining
35
Major Issues in Data Mining (1)
◼ User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data
mining results
36
Major Issues in Data Mining (2)
37
Summary
38
Summary