Frequent Patterns
Frequent Patterns
diaper}
i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
January 17, 2020 Data Mining: Concepts and Techniques 8
Apriori: A Candidate Generation-and-Test Approach
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
January 17, 2020 Data Mining: Concepts and Techniques 21
Bottleneck of Frequent-pattern Mining
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Patterns containing p
…
Pattern f
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
database partition
Method
For each frequent item, construct its conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
Tran. DB
Parallel projection needs a lot fcamp
of disk space fcabm
fb
Partition projection saves it cbp
fcamp
am-proj DB cm-proj DB
fc f …
fc f
fc f
January 17, 2020 Data Mining: Concepts and Techniques 33
FP-Growth vs. Apriori: Scalability With the Support
Threshold
70
Run time(sec.)
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
January 17, 2020 Data Mining: Concepts and Techniques 35
Why Is FP-Growth the Winner?
Divide-and-conquer:
decompose both the mining task and DB according to
the frequent patterns obtained so far
leads to focused search of smaller databases
Other factors
no candidate generation, no candidate test
compressed database: FP-tree structure
no repeated scan of entire database
basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
CLOSET (DMKD’00)
Mining sequential patterns
A, B, C, D, E 10 A,B,C,D,E
20 B,C,D,E,
2nd scan: find support for 30 A,C,D,F
AB, AC, AD, AE, ABCDE
BC, BD, BE, BCDE
Potential
P( A B)
lift
P( A) P( B) Milk No Milk Sum (row)
Coffee m, c ~m, c c
sup( X ) No Coffee m, ~c ~m, ~c ~c
all _ conf
max_ item _ sup( X ) Sum(col.) m ~m
them
Constrained mining vs. query processing in DBMS
Database query processing requires to find all
monotone b 0
c -20
Itemset ab violates C d 10
So does every superset of ab e -30
f 30
g 20
h -10
January 17, 2020 Data Mining: Concepts and Techniques 63
Monotonicity for Constraint Pushing
TDB (min_sup=2)
TID Transaction
Monotonicity
10 a, b, c, d, f
When an intemset S satisfies the 20 b, c, d, f, g, h
constraint, so does any of its 30 a, c, d, e, f
40 c, e, f, g
superset
sum(S.Price) v is monotone Item Profit
a 40
min(S.Price) v is monotone b 0
Example. C: range(S.profit) 15 c -20
d 10
Itemset ab satisfies C e -30
Succinctness:
Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
min(S.Price) v is succinct
sum(S.Price) v is not succinct
Optimization: If C is succinct, C is pre-counting pushable
sum(S) v ( a S, a 0 ) yes no no
sum(S) v ( a S, a 0 ) no yes no
range(S) v yes no no
range(S) v no yes no
support(S) no yes no
Monotone
Antimonotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible