Chapter 1 - Tagged
Chapter 1 - Tagged
Mining
Chapter 1. Introduction
• Why Data Mining?
• Summary
2
Data Science and Data-Driven
Discovery
Data science is an interdisciplinary field that studies and applies tools and techniques
for deriving useful
insights from data. Although data science is regarded as an emerging field with a distinct
identity of its own,
the tools and techniques often come from many different areas of data analysis, such
as data mining, statistics, AI, machine learning, pattern recognition, database
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
7
Chapter 1. Introduction
• Why Data Mining?
• Summary
8
What Is Data Mining?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
• Simple search and query processing
• (Deductive) expert systems
9
Knowledge Discovery (KDD) Process
• This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
• Data mining plays an essential role in
the knowledge discovery process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases 10
1.2 Motivating Challenges
• Scalability
• High Dimensionality
• Heterogeneous and Complex Data
• Data Ownership and Distribution
• Non-traditional Analysis
Example: A Web Mining Framework
12
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
14
KDD Process: A Typical View from ML and
Statistics
15
Example: Medical Data Mining
• Health care & medical data mining – often adopted such a view in
statistics and machine learning
• Preprocessing of the data (including feature extraction and dimension
reduction)
• Classification or/and clustering processes
• Post-processing for presentation
16
Chapter 1. Introduction
• Why Data Mining?
• Summary
17
Multi-Dimensional View of Data Mining
• Data to be mined
• Database data (extended-relational, object-oriented, heterogeneous, legacy),
data warehouse, transactional data, stream, spatiotemporal, time-series,
sequence, text and web, multi-media, graphs & social and information
networks
• Knowledge to be mined (or: Data mining functions)
• Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
• Descriptive vs. predictive data mining
• Multiple/integrated functions and mining at multiple levels
• Techniques utilized
• Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
• Applications adapted
• Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc. 18
Chapter 1. Introduction
• Why Data Mining?
• Summary
19
Data Mining: On What Kinds of Data?
• Summary
21
Data Mining Function: (1) Generalization
25
Data Mining Function: (5) Outlier Analysis
• Outlier analysis
• Outlier: A data object that does not comply with the general behavior of
the data
• Noise or exception? ― One person’s garbage could be another person’s
treasure
• Methods: by product of clustering or regression analysis, …
• Useful in fraud detection, rare events analysis
26
Time and Ordering: Sequential Pattern, Trend
and Evolution Analysis
27
Structure and Network Analysis
• Graph mining
• Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
• Information network analysis
• Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
• Multiple heterogeneous networks
• A person could be multiple information networks: friends, family,
classmates, …
• Links carry a lot of semantic information: Link mining
• Web mining
• Web is a big information network: from PageRank to Google
• Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …
28
Evaluation of Knowledge
• Are all mined knowledge interesting?
• One can mine tremendous amount of “patterns” and knowledge
• Some may fit only certain dimension space (time, location, …)
• Some may not be representative, may be transient, …
• Summary
30
Data Mining: Confluence of Multiple Disciplines
31
Why Confluence of Multiple Disciplines?
• Summary
33
Applications of Data Mining
• Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis (microarray
data analysis), biological sequence analysis, biological network analysis
• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server
Analysis Manager, Oracle Data Mining Tools) to invisible data mining
34
Chapter 1. Introduction
• Why Data Mining?
• Summary
35
Major Issues in Data Mining (1)
• Mining Methodology
• Mining various and new kinds of knowledge
• Mining knowledge in multi-dimensional space
• Data mining: An interdisciplinary effort
• Boosting the power of discovery in a networked environment
• Handling noise, uncertainty, and incompleteness of data
• Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
• Interactive mining
• Incorporation of background knowledge
• Presentation and visualization of data mining results
36
Major Issues in Data Mining (2)
37
Chapter 1. Introduction
• Why Data Mining?
• Summary
38
A Brief History of Data Mining Society
39
Conferences and Journals on Data Mining
• Web and IR
• Conferences: SIGIR, WWW, CIKM, etc.
• Journals: WWW: Internet and Web Information Systems,
• Statistics
• Conferences: Joint Stat. Meeting, etc.
• Journals: Annals of statistics, etc.
• Visualization
• Conference proceedings: CHI, ACM-SIGGraph, etc.
• Journals: IEEE Trans. visualization and computer graphics, etc.
41
Chapter 1. Introduction
• Why Data Mining?
• Summary
42
Summary
43
For each of the following data sets, explain whether or not data privacy is an important
issue.
(b) IP addresses and visit times of web users who visit your website.