Advanced
Advanced
1
Chapter 7 : Advanced Frequent Pattern Mining
◼ Summary
2
Research on Pattern Mining: A Road Map
3
Advanced Frequent Pattern Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Mining Multi-Level Association
◼ Mining Multi-Dimensional Association
◼ Mining Quantitative Association Rules
◼ Mining Rare Patterns and Negative Patterns
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
4
Mining Multiple-Level Association Rules
5
Multi-level Association: Flexible Support and
Redundancy filtering
◼ Flexible min-support thresholds: Some items are more valuable but
less frequent
◼ Use non-uniform, group-based min-support
◼ E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
◼ Redundancy Filtering: Some rules may be redundant due to
“ancestor” relationships between items
◼ milk wheat bread [support = 8%, confidence = 70%]
◼ 2% milk wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule
◼ A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor
6
Advanced Frequent Pattern Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Mining Multi-Level Association
◼ Mining Multi-Dimensional Association
◼ Mining Quantitative Association Rules
◼ Mining Rare Patterns and Negative Patterns
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
7
Mining Multi-Dimensional Association
◼ Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
◼ Multi-dimensional rules: 2 dimensions or predicates
◼ Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
◼ hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
◼ Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
◼ Quantitative Attributes: Numeric, implicit ordering among
values—discretization, clustering, and gradient approaches
8
Advanced Frequent Pattern Mining
◼ Pattern Mining: A Road Map
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Mining Multi-Level Association
◼ Mining Multi-Dimensional Association
◼ Mining Quantitative Association Rules
◼ Mining Rare Patterns and Negative Patterns
◼ Constraint-Based Frequent Pattern Mining
◼ Mining High-Dimensional Data and Colossal Patterns
◼ Mining Compressed or Approximate Patterns
◼ Pattern Exploration and Application
◼ Summary
9
Mining Quantitative Associations
10
Static Discretization of Quantitative Attributes
15
Defining Negative Correlated Patterns (II)
◼ Definition 2 (negative itemset-based)
◼ X is a negative itemset if (1) X = Ā U B, where B is a set of positive
items, and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ
◼ Itemsets X is negatively correlated, if
◼ Summary
17
Constraint-based (Query-Directed) Mining
18
Constraints in Data Mining
TDB (min_sup=2)
◼ A constraint C is anti-monotone if the super TID Transaction
pattern satisfies C, all of its sub-patterns do so
10 a, b, c, d, f
too
20 b, c, d, f, g, h
◼ In other words, anti-monotonicity: If an itemset 30 a, c, d, e, f
S violates the constraint, so does any of its 40 c, e, f, g
superset
Item Profit
◼ Ex. 1. sum(S.price) v is anti-monotone
a 40
◼ Ex. 2. range(S.profit) 15 is anti-monotone
b 0
◼ Itemset ab violates C
c -20
◼ So does every superset of ab
d 10
◼ Ex. 3. sum(S.Price) v is not anti-monotone e -30
◼ Ex. 4. support count is anti-monotone: core f 30
property used in Apriori g 20
h -10 22
Pattern Space Pruning with Monotonicity Constraints
TDB (min_sup=2)
◼ A constraint C is monotone if the pattern TID Transaction
monotone c -20
d -15
◼ Itemset {b, c}’s projected DB:
e -30
◼ T10’: {d, f, h}, T20’: {d, f, g, h}, T30’: {d, f, g}
f -10
◼ since C cannot satisfy T10’, T10’ can be pruned
g 20
h -5 24
Pattern Space Pruning with Succinctness
◼ Succinctness:
◼ Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based on
A1 , i.e., S contains a subset belonging to A1
◼ Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
◼ min(S.Price) v is succinct
◼ sum(S.Price) v is not succinct
◼ Optimization: If C is succinct, C is pre-counting pushable
25
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
26
Constrained Apriori : Push a Succinct Constraint
Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5} to be used
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price } <= 1
27
Constrained FP-Growth: Push a Succinct
Constraint Deep
1-Projected DB
TID Items
No Need to project on 2, 3, or 5
100 3 4
300 2 3 5
Constraint:
min{S.price } <= 1
28
Constrained FP-Growth: Push a Data
Anti-monotonic Constraint Deep
Remove from data
TID Items TID Items
100 134 100 1 3
200 235 300 1 3
300 1235 FP-Tree
400 25
Constraint:
min{S.price } <= 1
29
TID Transaction
Constrained FP-Growth: Push a Data
10 a, b, c, d, f, h
Anti-monotonic Constraint Deep 20 b, c, d, f, g, h
30 b, c, d, f, g
TID Transaction
40 a, c, e, f, g
10 a, b, c, d, f, h
20 b, c, d, f, g, h Item Profit
30 b, c, d, f, g FP-Tree a 40
40 a, c, e, f, g b 0
c -20
B-Projected DB Recursive
Data
TID Transaction Pruning
d -15
10 a, c, d, f, h e -30
20 c, d, f, g, h B
f -10
30 c, d, f, g FP-Tree g 20
h -5
Single branch: Constraint:
range{S.price } > 25
bcdfg: 2 min_sup >= 2
30
Convertible Constraints: Ordering Data in
Transactions
TDB (min_sup=2)
TID Transaction
◼ Convert tough constraints into anti-
10 a, b, c, d, f
monotone or monotone by properly
20 b, c, d, f, g, h
ordering items
30 a, c, d, e, f
◼ Examine C: avg(S.profit) 25 40 c, e, f, g
◼ Order items in value-descending Item Profit
order a 40
◼ <a, f, g, d, b, h, c, e> b 0
c -20
◼ If an itemset afb violates C d 10
◼ So does afbh, afb* e -30
f 30
◼ It becomes anti-monotone!
g 20
h -10
31
Strongly Convertible Constraints
33
Pattern Space Pruning w. Convertible Constraints
Item Value
◼ C: avg(X) >= 25, min_sup=2
a 40
◼ List items in every transaction in value
f 30
descending order R: <a, f, g, d, b, h, c, e>
g 20
◼ C is convertible anti-monotone w.r.t. R
d 10
◼ Scan TDB once b 0
◼ remove infrequent items h -10
◼ Item h is dropped c -20
◼ Itemsets a and f are good, … e -30
TDB (min_sup=2)
◼ Projection-based mining
TID Transaction
◼ Imposing an appropriate order on item
10 a, f, d, b, c
projection
20 f, g, d, b, c
◼ Many tough constraints can be converted into
30 a, f, d, c, e
(anti)-monotone
40 f, g, h, c, e
34
Handling Multiple Constraints
35
What Constraints Are Convertible?
36
Constraint-Based Mining — A General Picture
sum(S) v ( a S, a 0 ) yes no no
sum(S) v ( a S, a 0 ) no yes no
range(S) v yes no no
range(S) v no yes no
support(S) no yes no
37
Advanced Frequent Pattern Mining
◼ Summary
38
Mining Colossal Frequent Patterns
◼ F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal
Frequent Patterns by Core Pattern Fusion”, ICDE'07.
◼ We have many algorithms, but can we mine large (i.e., colossal)
patterns? ― such as just size around 50 to 100? Unfortunately, not!
◼ Why not? ― the curse of “downward closure” of frequent patterns
◼ The “downward closure” property
◼ Any sub-pattern of a frequent pattern is frequent.
◼ Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1,
a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There
are about 2100 such frequent itemsets!
◼ No matter using breadth-first search (e.g., Apriori) or depth-first
search (FPgrowth), we have to examine so many patterns
◼ Thus the downward closure property leads to explosion!
39
Colossal Patterns: A Motivating Example
Let’s make a set of 40 transactions Closed/maximal patterns may
T1 = 1 2 3 4 ….. 39 40 partially alleviate the problem but not
T2 = 1 2 3 4 ….. 39 40 really solve it: We often need to mine
: . scattered large patterns!
: . Let the minimum support threshold
: . σ= 20
: . 40
There are frequent patterns of
T40=1 2 3 4 ….. 39 40 20
size 20
Then delete the items on the diagonal
Each is closed and maximal
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40 # patterns = n 2n
: . 2 /
: . n / 2 n
: . The size of the answer set is
: . exponential to n
T40=1 2 3 4 …… 39
40
Colossal Pattern Set: Small but Interesting
41
Mining Colossal Patterns: Motivation and
Philosophy
◼ Motivation: Many real-world tasks need mining colossal patterns
◼ Micro-array analysis in bioinformatics (when support is low)
Transaction Database D
A colossal pattern α
α D
α1 Dαk
α2
Dα1
D α
Dα2
αk
45
Robustness of Colossal Patterns
◼ Core Patterns
Intuitively, for a frequent pattern α, a subpattern β is a τ-core
pattern of α if β shares a similar support set with α, i.e.,
| D |
0 1
| D |
46
Example: Core Patterns
◼ A colossal pattern has far more core patterns than a small-sized pattern
◼ A colossal pattern has far more core descendants of a smaller size c
◼ A random draw from a complete set of pattern of size c would more
likely to pick a core descendant of a colossal pattern
◼ A colossal pattern can be generated by merging a set of core patterns
Transaction (# of Ts) Core Patterns (τ = 0.5)
(abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e),
(abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce),
(bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe), (abcef)
47
Colossal Patterns Correspond to Dense Balls
49
Idea of Pattern-Fusion Algorithm
50
Pattern-Fusion: The Algorithm
◼ A bounded-breadth pattern
tree traversal
◼ It avoids explosion in
52
Pattern-Fusion Leads to Good Approximation
53
Experimental Setting
54
Experiment Results on Diagn
◼ LCM run time increases
exponentially with pattern
size n
◼ Pattern-Fusion finishes
efficiently
◼ The approximation error of
Pattern-Fusion (with min-sup
20) in comparison with the
complete set) is rather close
to uniform sampling (which
randomly picks K patterns
from the complete answer
set)
55
Experimental Results on ALL
◼ ALL: A popular gene expression data set with 38
transactions, each with 866 columns
◼ There are 1736 items in total
56
Experimental Results on REPLACE
◼ REPLACE
◼ A program trace data set, recording 4395 calls
and transitions
◼ The data set contains 4395 transactions with
57 items in total
◼ With support threshold of 0.03, the largest
57
Experimental Results on REPLACE
◼ Approximation error when
compared with the complete
mining result
◼ Example. Out of the total 98
patterns of size >=42, when
K=100, Pattern-Fusion returns
80 of them
◼ A good approximation to the
colossal patterns in the sense
that any pattern in the
complete set is on average at
most 0.17 items away from one
of these 80 patterns
58
Advanced Frequent Pattern Mining
◼ Summary
59
Mining Compressed Patterns: δ-clustering
◼ Why compressed patterns? ID Item-Sets Support
P1 {38,16,18,12} 205227
◼ too many, but less meaningful
P2 {38,16,18,12,17} 205211
◼ Pattern distance measure P3 {39,38,16,18,12,17} 101758
P4 {39,16,18,12,17} 161563
P5 {39,16,18,12} 161576
60
Redundancy-Award Top-k Patterns
◼ Summary
62
How to Understand and Interpret Patterns?
Semantic Information
Non-semantic info.
Definitions indicating
semantics
Examples of Usage
Synonyms
Related Words
Semantic Analysis with Context Models
Semantic Annotations
Pattern { x_yan, j_han} Context Units
Non Sup = …
< { p_yu, j_han}, { d_xin }, … , “graph pattern”,
CI {p_yu}, graph pattern, … … “substructure similarity”, … >
Trans. gSpan: graph-base……
SSPs { j_wang }, {j_han, p_yu}, …
◼ Summary
67
Summary
◼ Roadmap: Many aspects & extensions on pattern mining
◼ Mining patterns in multi-level, multi dimensional space
◼ Mining rare and negative patterns
◼ Constraint-based pattern mining
◼ Specialized methods for mining high-dimensional data
and colossal patterns
◼ Mining compressed or approximate patterns
◼ Pattern exploration and understanding: Semantic
annotation of frequent patterns
68
Ref: Mining Multi-Level and Quantitative Rules
◼ Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association
Rules, KDD'99
◼ T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using
two-dimensional optimized association rules: Scheme, algorithms, and
visualization. SIGMOD'96.
◼ J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. VLDB'95.
◼ R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.
◼ R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.
◼ R. Srikant and R. Agrawal. Mining quantitative association rules in large
relational tables. SIGMOD'96.
◼ K. Wang, Y. He, and J. Han. Mining frequent itemsets using support
constraints. VLDB'00
◼ K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing
optimized rectilinear regions for association rules. KDD'97.
69
Ref: Mining Other Kinds of Rules
◼ F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
paradigm for fast, quantifiable data mining. VLDB'98
◼ Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of
Functional and Approximate Dependencies Using Partitions. ICDE’98.
◼ H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern
Extraction with Fascicles. VLDB'99
◼ B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.
◼ R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96.
◼ A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative
associations in a large database of customer transactions. ICDE'98.
◼ D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov.
Query flocks: A generalization of association-rule mining. SIGMOD'98.
70
Ref: Constraint-Based Pattern Mining
◼ R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item
constraints. KDD'97
◼ R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning
optimizations of constrained association rules. SIGMOD’98
◼ G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained
correlated sets. ICDE'00
◼ J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with
Convertible Constraints. ICDE'01
◼ J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in
Large Databases, CIKM'02
◼ F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. ExAnte: Anticipated
Data Reduction in Constrained Pattern Mining, PKDD'03
◼ F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing
Framework for Graph Pattern Mining”, PAKDD'07
71
Ref: Mining Sequential Patterns
◼ X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with
gap constraints. ICDM'05
◼ H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
◼ J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.
◼ R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. EDBT’96.
◼ X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
◼ M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine
Learning:01.
72
Mining Graph and Structured Patterns
◼ A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for
mining frequent substructures from graph data. PKDD'00
◼ M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
◼ X. Yan and J. Han. gSpan: Graph-based substructure pattern mining.
ICDM'02
◼ X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns.
KDD'03
◼ X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent
structure analysis. ACM TODS, 30:960–993, 2005
◼ X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity
search. ACM Trans. Database Systems, 31:1418–1453, 2006
73
Ref: Mining Spatial, Spatiotemporal, Multimedia Data
74
Ref: Mining Frequent Patterns in Time-Series Data
75
Ref: FP for Classification and Clustering
◼ G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
◼ B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule
Mining. KDD’98.
◼ W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based
on Multiple Class-Association Rules. ICDM'01.
◼ H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets. SIGMOD’ 02.
◼ J. Yang and W. Wang. CLUSEQ: efficient and effective sequence
clustering. ICDE’03.
◼ X. Yin and J. Han. CPAR: Classification based on Predictive Association
Rules. SDM'03.
◼ H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern
Analysis for Effective Classification”, ICDE'07
76
Ref: Privacy-Preserving FP Mining
77
Mining Compressed Patterns
◼ D. Xin, H. Cheng, X. Yan, and J. Han. Extracting redundancy-
aware top-k patterns. KDD'06
◼ D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed
frequent-pattern sets. VLDB'05
◼ X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset
patterns: A profile-based approach. KDD'05
78
Mining Colossal Patterns
◼ F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal
frequent patterns by core pattern fusion. ICDE'07
◼ F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han. P. S. Yu, Mining Top-K Large
Structural Patterns in a Massive Network. VLDB’11
79
Ref: FP Mining from Data Streams
◼ Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional
Regression Analysis of Time-Series Data Streams. VLDB'02.
◼ R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for
finding frequent elements in streams and bags. TODS 2003.
◼ G. Manku and R. Motwani. Approximate Frequency Counts over Data
Streams. VLDB’02.
◼ A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of
frequent and top-k elements in data streams. ICDT'05
80
Ref: Freq. Pattern Mining Applications
81