Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
1
Introduction
Motivation: Why data mining?
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras,
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
What Is Data Mining?
5
Why Data Mining?—Potential Applications
Other Applications
Text mining (news group, email, documents) and Web
mining
Stream data mining
Bioinformatics and bio-data analysis
7
Market Analysis and Management
8
Market Analysis and Management
Cross-market analysis
Associations/co-relations between product sales, &
prediction based on such association
Customer profiling
What types of customers buy what products
9
Fraud Detection & Mining Unusual Patterns
11
Data Mining: A KDD Process
Task-relevant Data
Data Cleaning
Data Integration
Databases
12
Steps of a KDD Process
13
Architecture: Typical Data Mining System
Pattern evaluation
Data
Databases Warehouse
14
Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
Spatial and temporal data
Time-series data
Stream data
Multimedia database
15
Data Mining Functionalities
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics
Association (correlation and causality)
Diaper Beer [0.5%, 75%]
Classification and Prediction
Construct models (functions) that describe and distinguish classes
or concepts for future prediction
Presentation: decision-tree, classification rule, neural network
16
Data Mining Functionalities
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
Outlier analysis
Outlier: a data object that does not comply with the general
17
Are All the “Discovered” Patterns Interesting?
18
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
19
Data Mining: Classification Schemes
20
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at multiple
levels
21
Multi-Dimensional View of Data Mining
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-
data mining, stock market analysis, Web mining, etc.
22
OLAP Mining: Integration of Data Mining and Data Warehousing
23
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing
one: knowledge fusion
24
Major Issues in Data Mining
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of
abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
25
Summary
Data mining: discovering interesting patterns from large amounts of
data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
26
Where to Find References?
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
Data mining and KDD
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
27