DM 01 Introduction ML Data Mining
DM 01 Introduction ML Data Mining
An Introduction
Gregory Piatetsky-Shapiro
KDnuggets
Course Outline
Machine Learning
input, representation, decision trees
Weka
machine learning workbench
Data Mining
associations, deviation detection, clustering, visualization
Case Studies
targeted marketing, genomic microarrays
Data Mining, Privacy and Security
Final Project: Microarray Data Mining Competition
2
Lesson Outline
3
Trends leading to Data Flood
More data is generated:
Bank, telecom, other
business transactions ...
Scientific data: astronomy,
biology, etc
Web, text, and e-commerce
4
Big Data Examples
5
Largest databases in 2003
Commercial databases:
Winter Corp. 2003 Survey: France Telecom has largest
decision-support DB, ~30TB; AT&T ~ 26 TB
Web
Alexa internet archive: 7 years of data, 500 TB
Google searches 4+ Billion pages, many hundreds TB
IBM WebFountain, 160 TB (2003)
Internet Archive (www.archive.org),~ 300 TB
6
From terabytes to exabytes to …
7
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
8
Data Growth
9
Data Growth Rate
10
Lesson Outline
11
Machine Learning / Data Mining
Application areas
Science
astronomy, bioinformatics, drug discovery, …
Business
CRM (Customer Relationship management), fraud detection, e-
commerce, manufacturing, sports/entertainment, telecom,
targeted marketing, health care, …
Web:
search engines, advertising, web and text mining, …
Government
surveillance (?|), crime detection, profiling tax cheaters, …
12
Application Areas
13
Data Mining for Customer Modeling
Customer Tasks:
attrition prediction
targeted marketing:
cross-sell, customer acquisition
credit-risk
fraud detection
Industries
banking, telecom, retail sales, …
14
Customer Attrition: Case Study
15
Customer Attrition: Case Study
Task:
Predict who is likely to attrite next
month.
Estimate customer value and what is the
cost-effective offer to be made to this
customer.
16
Customer Attrition Results
17
Assessing Credit Risk: Case Study
18
Credit Risk - Results
19
e-commerce
20
Successful e-commerce – Case
Study
Task: Recommend other books (products) this
person is likely to buy
Amazon does clustering based on books bought:
customers who bought “Advances in Knowledge
Discovery and Data Mining”, also bought “Data
Mining: Practical Machine Learning Tools and
Techniques with Java Implementations”
21
Unsuccessful e-commerce case study
(KDD-Cup 2000)
Data: clickstream and purchase data from Gazelle.com,
legwear and legcare e-tailer
Q: Characterize visitors who spend more than $12 on an
average order at the site
Dataset of 3,465 purchases, 1,831 customers
Very interesting analysis by Cup participants
thousands of hours - $X,000,000 (Millions) of consulting
22
Genomic Microarrays – Case Study
23
Example: ALL/AML data
38 training cases, 34 test, ~ 7,000 genes
2 Classes: Acute Lymphoblastic Leukemia (ALL)
vs Acute Myeloid Leukemia (AML)
Use train data to build diagnostic model
ALL AML
Securities Fraud
NASDAQ KDD system
Phone fraud
AT&T, Bell Atlantic, British Telecom/MCI
28
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data Mining,
Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy, (Chapter
1), AAAI/MIT Press 1996
29
Related Fields
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
30
Statistics, Machine Learning and
Data Mining
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of data mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
Distinctions are fuzzy
31
witten&eibe
Knowledge Discovery Process
flow, according to CRISP-DM
see
Monitoring www.crisp-dm.org
for more
information
32
Historical Note:
Many Names of Data Mining
Data Fishing, Data Dredging: 1960-
used by Statistician (as bad name)
Data Mining :1990 --
used DB, business
in 2003 – bad image because of TIA
Knowledge Discovery in Databases (1989-)
used by AI, Machine Learning Community
also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
34
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships
…
35
Data Mining Tasks: Classification
Learn a method for predicting the instance class
from pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
36
Data Mining Tasks: Clustering
Find “natural” grouping of
instances given un-labeled data
37
Summary:
38
More on Data Mining
and Knowledge Discovery
KDnuggets.com
News, Publications
Software, Solutions
Courses, Meetings, Education
Publications, Websites, Datasets
Companies, Jobs
…
39
Data Mining Jobs in KDnuggets
KDnuggets Job Ads
180
160
140
120
100 Industry
80 Academic
60
40
20
0
1999
2002
2003
1997
1998
2000
2001
2004
2005
40