Data Mining Methods

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Introduction to

Data Mining Methods


Data Mining:
Data Mining Methods
with Dr. Qin Lv
Learning objective: Identify the core
functionalities of data modeling in the
data mining pipeline. Apply the Apriori
algorithm for frequent itemset mining.
Data Mining: Four Views
Application

Knowledge Technique

Data
Application
Data Mining Pipeline
Knowledge
Pattern
evaluation

Data modeling

Data
warehousing
Technique
Data
preprocessing

Data
understanding

Data
Technique View
Ø Frequent pattern analysis
Ø Classification, prediction
Ø Clustering
Ø Anomaly detection
Ø Trend and evolution analysis
Frequent Pattern
Analysis
Ø Frequent itemset
Ø Frequent sequence
Ø Frequent structure
Ø Association rules
Ø Correlation analysis
Classification
Ø Pre-defined
classes
Ø Need training data
Ø Build model to
distinguish classes
Prediction
Ø Numerical prediction
(continuous value)
• E.g., weather
• E.g., stock price
• E.g., traffic
Clustering
Ø No predefined
classes
Ø Intra-cluster
similarity
Ø Inter-cluster
dissimilarity
Anomaly Detection
280

Ø Anomaly/outlier 260 Unusual Time


Series Snippets Level Shifting

• Differ from the “norm” 240

Kelvin
• E.g., error, noise 220

• E.g., fraud 200

• E.g., extreme events 180

160
199 200 200 200 200 200
8-0 0-0 1-0 2-1 4-0 5-0
9-1 1-3 6-1 0-2 3-1 7-2
8 1 5 8 1 4
Date (yyyy-mm-dd)
Trend and Evolution Analysis
Ø Changes over time
• Overall trend
• Periodical patterns
• Anomalies
• E.g.,
Data Mining Methods
Ø Frequent pattern analysis
Ø Classification
Ø Clustering
Ø Outlier analysis
Market Basket Analysis
Tid Items
Ø List of transactions
1 A, B, C, E
• Each Ti contains multiple items
2 A, D, E
Ø (Frequent) itemset
• X = {x1, x2, …, xk} 3 B, C, E

Ø (Minimum) support 4 B, C, D, E
• Probability of Ti containing X 5 B, D, E
Frequent Pattern Mining
Ø Brute force approach (e.g., 100 items)

Ø Closed pattern X: no super-pattern Y ⊃ X


w/ the same support
Ø Max-pattern X: no super-pattern Y ⊃ X
Closed & Max Pattern Example
Ø {<a1, ..., a100>, <a1, ..., a50>} min_sup = 0.5
Ø Frequent pattern? all item combinations
Ø Closed pattern?
• <a1, ..., a100>: 1; <a1, ..., a50>: 2
Ø Max-pattern?
• <a1, ..., a100>: 1
Apriori Algorithm
Ø Apriori pruning: if X is infrequent, then any
of its superset cannot be frequent
Ø Procedure
• Scan dataset to get freq. 1-itemsets
• Generate candidate (k+1)-itemsets from freq. k-itemsets
• Scan dataset to remove infreq. candidate (k+1)-itemsets
• Stop when no more freq. or candidate itemsets
Itemset #
Apriori Algorithm Example {B, C} 3
What about {B, D, E} {B, D} 2
Ø min_sup = 0.6 or {C, D, E}?
{B, E} 4
Tid Items Itemset #
{C, D} 1
1 A, B, C, E {A} 2
{C, E} 3
2 A, D, E {B} 4
{D, E} 3
3 B, C, E {C} 3
4 B, C, D, E {D} 3 Itemset #
5 B, D, E {E} 5 {B, C, E} 3
Important Details
Ø Self-joining of k-itemsets => (k+1)-itemsets
• Only join if their first (k-1) items are the same
Ø Pruning: remove if subset is not frequent
Ø Example: L3 = {abc, abd, acd, ace, bcd}
• abc and abd => abcd and bcd is in L3 => valid candidate
• acd and ace => acde but ade is not in L3 => pruned

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy