0% found this document useful (0 votes)
70 views

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

Introduction Data Mining Tasks Classification & Evaluation Clustering Application Examples

Uploaded by

Asim Tahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Data Mining Tutorial: Gregory Piatetsky-Shapiro Kdnuggets

Introduction Data Mining Tasks Classification & Evaluation Clustering Application Examples

Uploaded by

Asim Tahir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Mining

Tutorial
Gregory Piatetsky-Shapiro
KDnuggets

© 2006 KDnuggets
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples

2
© 2006 KDnuggets
Trends leading to Data Flood
 More data is generated:
 Web, text, images …
 Business transactions, calls,
...
 Scientific data: astronomy,
biology, etc

 More data is captured:


 Storage technology faster
and cheaper
 DBMS can handle bigger DB

3
© 2006 KDnuggets
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp

4
© 2006 KDnuggets
Data Growth

In 2 years (2003 to 2005),


the size of the largest database TRIPLED!

5
© 2006 KDnuggets
Data Growth Rate

 Twice as much information was created in 2002


as in 1999 (~30% growth rate)
 Other growth rate estimates even higher
 Very little data will ever be looked at by a human

Knowledge Discovery is NEEDED to make sense


and use of data.

6
© 2006 KDnuggets
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
 valid
 novel
 potentially useful
 and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

7
© 2006 KDnuggets
Related Fields

Machine Visualization
Learning
Data Mining and
Knowledge Discovery

Statistics Databases

8
© 2006 KDnuggets
Statistics, Machine Learning and
Data Mining
 Statistics:
 more theory-based
 more focused on testing hypotheses
 Machine learning
 more heuristic
 focused on improving performance of a learning agent
 also looks at real-time learning and robotics – areas not part of data
mining
 Data Mining and Knowledge Discovery
 integrates theory and heuristics
 focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
 Distinctions are fuzzy

9
© 2006 KDnuggets
Knowledge Discovery Process
flow, according to CRISP-DM

see
Monitoring www.crisp-dm.org
for more
information

Continuous
monitoring and
improvement is
an addition to CRISP

10
© 2006 KDnuggets
Historical Note:
Many Names of Data Mining
 Data Fishing, Data Dredging: 1960-
 used by statisticians (as bad name)

 Data Mining :1990 --


 used in DB community, business

 Knowledge Discovery in Databases (1989-)


 used by AI, Machine Learning Community
 also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery
are used interchangeably
11
© 2006 KDnuggets
Data Mining Tasks

© 2006 KDnuggets
Some Definitions

 Instance (also Item or Record):


 an example, described by a number of attributes,
 e.g. a day can be described by temperature, humidity
and cloud status

 Attribute or Field
 measuring aspects of the Instance, e.g. temperature

 Class (Label)
 grouping of instances, e.g. days good for playing

13
© 2006 KDnuggets
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
 Deviation Detection: finding changes
 Estimation: predicting a continuous value
 Link Analysis: finding relationships
…
© 2006 KDnuggets 14
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances

Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

15
© 2006 KDnuggets
Clustering
Find “natural” grouping of
instances given un-labeled data

16
© 2006 KDnuggets
Association Rules &
Frequent Itemsets
Transactions
TID Produce Frequent Itemsets:
1 MILK, BREAD, EGGS
2 BREAD, SUGAR Milk, Bread (4)
3 BREAD, CEREAL
Bread, Cereal (3)
4 MILK, BREAD, SUGAR
5 MILK, CEREAL Milk, Bread, Cereal (2)
6 BREAD, CEREAL …
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL

Rules:
Milk => Bread (66%)

17
© 2006 KDnuggets
Visualization & Data Mining
 Visualizing the data to
facilitate human
discovery

 Presenting the
discovered results in a
visually "nice" way

18
© 2006 KDnuggets
Summarization

 Describe features of the


selected group
 Use natural language
and graphics
 Usually in Combination
with Deviation detection
or other methods

Average length of stay in this study area rose 45.7 percent,


from 4.3 days to 6.2 days, because ...

19
© 2006 KDnuggets
Data Mining Central Quest

Find true patterns


and avoid overfitting

(finding seemingly signifcant


but really random patterns due
to searching too many possibilites)
20
© 2006 KDnuggets

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy