Ch5 DataMIning
Ch5 DataMIning
— Chapter 5 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
diaper}
i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
April 18, 2013 Data Mining: Concepts and Techniques
10
Apriori: A Candidate Generation-and-Test
Approach
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
April 18, 2013 Data Mining: Concepts and Techniques
13
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
April 18, 2013 Data Mining: Concepts and Techniques
14
How to Generate Candidates?
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
April 18, 2013 Data Mining: Concepts and Techniques
23
Bottleneck of Frequent-pattern Mining
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Patterns containing p
…
Pattern f
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
April 18, 2013 Data Mining: Concepts and Techniques
29
From Conditional Pattern-bases to Conditional FP-trees
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
database partition
Method
For each frequent item, construct its conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
Tran. DB
Parallel projection needs a lot fcamp
of disk space fcabm
fb
Partition projection saves it cbp
fcamp
am-proj DB cm-proj DB
fc f …
fc f
fc f
April 18, 2013 Data Mining: Concepts and Techniques
35
FP-Growth vs. Apriori: Scalability With the Support
Threshold
70
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
April 18, 2013 Data Mining: Concepts and Techniques
37
Why Is FP-Growth the Winner?
Divide-and-conquer:
decompose both the mining task and DB according to
the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test
compressed database: FP-tree structure
no repeated scan of entire database
basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
support
Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)
mined is maximized
2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat
Cluster adjacent association
rules to form general
rules using a 2-D grid
Example
age(X,”34-35”) ∧ income(X,”30-50K”)
⇒ buys(X,”high resolution TV”)
Dec.’02
Dimension/level constraint
small sales (price < $10) triggers big sales (sum >
$200)
Interestingness constraint
60%
April 18, 2013 Data Mining: Concepts and Techniques
63
Constrained Mining vs. Constraint-Based Search
them
Constrained mining vs. query processing in DBMS
Database query processing requires to find all
superset 40 c, e, f, g
Succinctness:
Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
min(S.Price) ≤ v is succinct
sum(S.Price) ≥ v is not succinct
Optimization: If C is succinct, C is pre-counting pushable
April 18, 2013 Data Mining: Concepts and Techniques
67
The Apriori Algorithm — Example
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
April 18, 2013 Data Mining: Concepts and Techniques
68
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
April 18, 2013 Data Mining: Concepts and Techniques
69
The Constrained Apriori Algorithm: Push
an Anti-monotone Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
April 18, 2013 Data Mining: Concepts and Techniques
70
The Constrained Apriori Algorithm: Push a
Succinct Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5} to be used
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price } <= 1
April 18, 2013 Data Mining: Concepts and Techniques
71
Converting “Tough” Constraints
TDB (min_sup=2)
TID Transaction
Convert tough constraints into anti-
10 a, b, c, d, f
monotone or monotone by properly
20 b, c, d, f, g, h
ordering items 30 a, c, d, e, f
Examine C: avg(S.profit) ≥ 25 40 c, e, f, g
Order items in value-descending Item Profit
order a 40
<a, f, g, d, b, h, c, e> b 0
c -20
If an itemset afb violates C d 10
So does afbh, afb* e -30
f 30
It becomes anti-monotone! g 20
h -10
April 18, 2013 Data Mining: Concepts and Techniques
72
Strongly Convertible Constraints
S ⊆ V yes no yes
min(S) ≤ v no yes yes
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
range(S) ≥ v no yes no
support(S) ≤ ξ no yes no
Monotone
Antimonoton
e
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible