Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
Agenda
11 Session
Session Overview
Overview
Mining
Mining Frequent
Frequent Patterns,
Patterns,
22 Association,
Association, and
and Correlations
Correlations
33 Summary
Summary and
and Conclusion
Conclusion
2
What is the class about?
Textbooks:
» Data Mining: Concepts and Techniques (2nd Edition)
Jiawei Han, Micheline Kamber
Morgan Kaufmann
ISBN-10: 1-55860-901-6, ISBN-13: 978-1-55860-901-3, (2006)
» Microsoft SQL Server 2008 Analysis Services Step by Step
Scott Cameron
Microsoft Press
ISBN-10: 0-73562-620-0, ISBN-13: 978-0-73562-620-31 1st Edition (04/15/09)
Session Agenda
4
Icons / Metaphors
Information
Common Realization
Knowledge/Competency Pattern
Governance
Alignment
Solution Approach
55
Agenda
11 Session
Session Overview
Overview
Mining
Mining Frequent
Frequent Patterns,
Patterns,
22 Association,
Association, and
and Correlations
Correlations
33 Summary
Summary and
and Conclusion
Conclusion
6
Mining Frequent Patterns, Association and Correlations – Sub-Topics
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence
analysis. 8
Why Is Freq. Pattern Mining Important?
10
Basic Concepts: Association Rules
11
12
Closed Patterns and Max-Patterns
13
14
Mining Frequent Patterns, Association and Correlations – Sub-Topics
15
16
Apriori: A Candidate Generation & Test Approach
18
The Apriori Algorithm (Pseudo-Code)
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
19
Implementation of Apriori
20
How to Count Supports of Candidates?
21
Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8
1+2356
13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
22
Candidate Generation: An SQL Implementation
23
24
Further Improvement of the Apriori Method
25
26
DHP: Reduce the Number of Candidates
ABCD
Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
29
Pattern-Growth Approach:
Mining Frequent Patterns Without Candidate Generation
31
32
Find Patterns Having P From P-conditional Database
34
Recursion: Mining Each Conditional FP-tree
{}
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
35
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
36
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern
mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Never be larger than the original database (not count
node-links and the count field)
37
38
Scaling FP-growth by Database Projection
39
Partition-Based Projection
am-proj DB cm-proj DB
fc f …
fc f
fc f
40
FP-Growth vs. Apriori: Scalability With the Support Threshold
70
Run time(sec.)
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
41
100
Runtime (sec.)
80
60
40
20
0
0 0.5 1 1.5 2
Support threshold (%)
42
Advantages of the Pattern Growth Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the
frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
A good open-source implementation and refinement of
FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
43
46
CLOSET+: Mining Closed Itemsets by Pattern-Growth
47
48
Visualization of Association Rules: Plane Graph
49
50
Visualization of Association Rules (SGI/MineSet 3.0)
51
53
54
Multi-level Association: Redundancy Filtering
55
Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
56
Mining Quantitative Associations
57
58
Quantitative Association Rules
59
60
Mining Frequent Patterns, Association and Correlations – Sub-Topics
61
62
Are lift and χ2 Good Measures of Correlation?
63
Null-Invariant Measures
64
Comparison of Interestingness Measures
Null-transactions Kulczynski
w.r.t. m and c measure (1927) Null-invariant
67
68
Constraint-based (Query-Directed) Mining
71
TDB (min_sup=2)
TID Transaction
A constraint C is antimonotone if the super
pattern satisfies C, all of its sub-patterns do so 10 a, b, c, d, f
too 20 b, c, d, f, g, h
In other words, anti-monotonicity: If an itemset S 30 a, c, d, e, f
violates the constraint, so does any of its 40 c, e, f, g
superset Item Profit
Ex. 1. sum(S.price) ≤ v is anti-monotone a 40
Ex. 2. range(S.profit) ≤ 15 is anti-monotone b 0
Itemset ab violates C c -20
So does every superset of ab
d 10
Ex. 3. sum(S.Price) ≥ v is not anti-monotone
e -30
Ex. 4. support count is anti-monotone: core
f 30
property used in Apriori
g 20
h -10
72
Monotonicity for Constraint Pushing
TDB (min_sup=2)
TID Transaction
A constraint C is monotone if the
10 a, b, c, d, f
pattern satisfies C, we do not need to
20 b, c, d, f, g, h
check C in subsequent mining
30 a, c, d, e, f
Alternatively, monotonicity: If an itemset 40 c, e, f, g
S satisfies the constraint, so does any Item Profit
of its superset a 40
b 0
Ex. 1. sum(S.Price) ≥ v is monotone
c -20
Ex. 2. min(S.Price) ≤ v is monotone d 10
Ex. 3. C: range(S.profit) ≥ 15 e -30
Itemset ab satisfies C f 30
So does every superset of ab g 20
h -10
73
TDB (min_sup=2)
TID Transaction
A constraint c is data antimonotone if for a pattern 10 a, b, c, d, f, h
p cannot satisfy a transaction t under c, p’s
20 b, c, d, f, g, h
superset cannot satisfy t under c either
30 b, c, d, f, g
The key for data antimonotone is recursive data
40 c, e, f, g
reduction Item Profit
Ex. 1. sum(S.Price) ≥ v is data antimonotone a 40
Ex. 2. min(S.Price) ≤ v is data antimonotone b 0
Ex. 3. C: range(S.profit) ≥ 25 is data antimonotone c -20
Itemset {b, c}’s projected DB: d -15
T10’: {d, f, h}, T20’: {d, f, g, h}, T30’: {d, f, g} e -30
since C cannot satisfy T10’, T10’ can be pruned f -10
g 20
h -5
74
Succinctness
Succinctness:
Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
min(S.Price) ≤ v is succinct
sum(S.Price) ≥ v is not succinct
Optimization: If C is succinct, C is pre-counting
pushable
75
77
78
The Constrained FP-Growth Algorithm: Push a Succinct Constraint Deep
1-Projected DB
TID Items
100 3 4 No Need to project on 2, 3, or 5
300 2 3 5
Constraint:
min{S.price } <= 1
79
Constraint:
min{S.price } <= 1
80
The Constrained FP-Growth Algorithm: TID Transaction
Push a Data Antimonotonic Constraint Deep
10 a, b, c, d, f, h
20 b, c, d, f, g, h
30 b, c, d, f, g
TID Transaction 40 a, c, e, f, g
10 a, b, c, d, f,
h f, g, Item Profit
20 b, c, d,
FP-Tree a 40
30 b, c, hd, f, g
b 0
40 a, c, e, f, g
c -20
B-Projected DB Recursive
Data d -15
TID Transaction Pruning
e -30
10 a, c, d, f, h
20 c, d, f, g, h B f -10
FP-Tree g 20
30 c, d, f, g
h -5
87
88
A Classification of Constraints
Monotone
Antimonotone
Strongly
convertible
Succinct
Convertible Convertible
anti-monotone monotone
Inconvertible
89
90
Why Mining Colossal Frequent Patterns?
91
Then delete the items on the diagonal Each is closed and maximal
# patterns = n 2n
T1 = 2 3 4 ….. 39 40 ≈ 2/π
T2 = 1 3 4 ….. 39 40
n/ 2 n
: .
The size of the answer set is
: .
exponential to n
: .
: .
T40=1 2 3 4 …… 39 92
Colossal Pattern Set: Small but Interesting
93
94
Alas, A Show of Colossal Pattern Mining!
Transaction Database D
A colossal pattern α
α D
α1 Dαk
α2
D
Dα1
α
Dα2
αk
97
Core Patterns
Intuitively, for a frequent pattern α, a subpattern β is a τ-core
pattern of α if β shares a similar support set with α, i.e.,
| Dα |
≥τ 0 <τ ≤1
| Dβ |
where τ is called the core ratio
98
Example: Core Patterns
A colossal pattern has far more core patterns than a small-sized pattern
A colossal pattern has far more core descendants of a smaller size c
A random draw from a complete set of pattern of size c would more
likely to pick a core descendant of a colossal pattern
A colossal pattern can be generated by merging a set of core patterns
(abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e), (abc),
(abf), (abe), (ace), (acf), (afe), (bcf), (bce), (bfe), (cfe),
(abcf), (abce), (bcfe), (acfe), (abfe), (abcef)
99
1
Dist ( α , β ) ≤ 1 − = r (τ )
2 /τ − 1
Once we identify one core pattern, we will be able to find all the other
core patterns by a bounding ball of radius r(τ)
100
Colossal Patterns Correspond to Dense Balls
101
103
A bounded-breadth
pattern tree traversal
It avoids explosion in
mining mid-sized ones
Randomness comes to
help to stay on the right
path
Ability to identify “short-
cuts” and take “leaps”
fuse small patterns
together in one step to
generate new patterns of
significant sizes
Efficiency
104
Pattern-Fusion Leads to Good Approximation
105
Experimental Setting
106
Experiment Results on Diagn
108
Experimental Results on REPLACE
REPLACE
A program trace data set, recording 4395
calls and transitions
The data set contains 4395 transactions with
57 items in total
With support threshold of 0.03, the largest
patterns are of size 44
They are all discovered by Pattern-Fusion
with different settings of K and τ, when started
with an initial pool of 20948 patterns of size
<=3
109
0.008
P
80 of them 0.005
0.004
A good approximation to the
0.003
colossal patterns in the sense 0.002
that any pattern in the complete 0.001
80 patterns
110
Agenda
11 Session
Session Overview
Overview
Mining
Mining Frequent
Frequent Patterns,
Patterns,
22 Association,
Association, and
and Correlations
Correlations
33 Summary
Summary and
and Conclusion
Conclusion
111
113
114
Ref: Apriori and Its Improvements
115
116
Ref: Vertical Format and Row Enumeration Methods
117
118
Ref: Mining Correlations and Interesting Rules
119
120
Ref: Constraint-Based Pattern Mining
121
122
Ref: Mining Spatial, Multimedia, and Web Data
123
125
126
Ref: FP for Classification and Clustering
127
128
Ref: Other Freq. Pattern Mining Applications
Readings
» Chapter 5
Individual Project #1
» Ongoing
131
132