FPA - Advance
FPA - Advance
Concepts and
Techniques
1
Advanced Frequent Pattern Mining
Summary
2
Research on Pattern Mining: A Road Map
3
Advanced Frequent Pattern Mining
Pattern Mining: A Road Map
Pattern Mining in Multi-Level, Multi-Dimensional
Space
Mining Multi-Level Association
Mining Multi-Dimensional Association
Mining Quantitative Association Rules
Mining Rare Patterns and Negative Patterns
Constraint-Based Frequent Pattern Mining
Mining High-Dimensional Data and Colossal Patterns
Mining Compressed or Approximate Patterns
Pattern Exploration and Application
Summary 4
Mining Multiple-Level Association
Rules
Items often form hierarchies
Flexible support settings
Items at the lower level are expected to have
lower support
Exploration of shared multi-level mining (Agrawal
& Srikant@VLB’95, Han & Fu@VLDB’95)
5
Multi-level Association: Flexible Support
and Redundancy filtering
Flexible min-support thresholds: Some items are more valuable
but less frequent
Use non-uniform, group-based min-support
E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%;
…
Redundancy Filtering: Some rules may be redundant due to
“ancestor” relationships between items
milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule
A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor
6
Advanced Frequent Pattern Mining
Pattern Mining: A Road Map
Pattern Mining in Multi-Level, Multi-Dimensional
Space
Mining Multi-Level Association
Mining Multi-Dimensional Association
Mining Quantitative Association Rules
Mining Rare Patterns and Negative Patterns
Constraint-Based Frequent Pattern Mining
Mining High-Dimensional Data and Colossal Patterns
Mining Compressed or Approximate Patterns
Pattern Exploration and Application
Summary 7
Mining Multi-Dimensional
Association
Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
Multi-dimensional rules: 2 dimensions or predicates
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
Categorical Attributes: finite number of possible values,
no ordering among values—data cube approach
Quantitative Attributes: Numeric, implicit ordering
among values—discretization, clustering, and gradient
approaches
8
Advanced Frequent Pattern Mining
Pattern Mining: A Road Map
Pattern Mining in Multi-Level, Multi-Dimensional
Space
Mining Multi-Level Association
Mining Multi-Dimensional Association
Mining Quantitative Association Rules
Mining Rare Patterns and Negative Patterns
Constraint-Based Frequent Pattern Mining
Mining High-Dimensional Data and Colossal Patterns
Mining Compressed or Approximate Patterns
Pattern Exploration and Application
Summary 9
Mining Quantitative Associations
Summary
17
Constraint-based (Query-Directed)
Mining
category
Rule (or pattern) constraint
small sales (price < $10) triggers big sales (sum >
$200)
Interestingness constraint
strong rules: min_support 3%, min_confidence
60%
19
Meta-Rule Guided Mining
Meta-rule can be in the rule form with partially instantiated
predicates and constants
P1(X, Y) ^ P2(X, W) => buys(X, “iPad”)
The resulting rule derived can be
age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”)
In general, it can be in the form of
P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr
Method to find meta-rules
Find frequent (l+r) predicates (based on min-support threshold)
Push constants deeply when possible into the mining process
(see the remaining discussions on constraint-push techniques)
Use confidence, correlation, and other measures when possible
20
Constraint-Based Frequent Pattern
Mining
Pattern space pruning constraints
Anti-monotonic: If constraint c is violated, its further
mining can be terminated
Monotonic: If c is satisfied, no need to check c again
Succinct: c must be satisfied, so one can start with the
data sets satisfying c
Convertible: c is not monotonic nor anti-monotonic, but it
can be converted into it if items in the transaction can be
properly ordered
Data space pruning constraint
Data succinct: Data space can be pruned at the initial
pattern mining process
Data anti-monotonic: If a transaction t does not satisfy c, t
can be pruned from its further mining
21
Pattern Space Pruning with Anti-Monotonicity
Constraints
TDB (min_sup=2)
A constraint C is anti-monotone if the super TID Transaction
pattern satisfies C, all of its sub-patterns do
10 a, b, c, d, f
so too
20 b, c, d, f, g, h
In other words, anti-monotonicity: If an 30 a, c, d, e, f
itemset S violates the constraint, so does 40 c, e, f, g
any of its superset
Ex. 1. sum(S.price) v is anti-monotone Item Profit
Succinctness:
Given A1, the set of items satisfying a succinctness
constraint C, then any set S satisfying C is based
on A1 , i.e., S contains a subset belonging to A1
Idea: Without looking at the transaction database,
whether an itemset S satisfies constraint C can be
determined based on the selection of items
min(S.Price) v is succinct
sum(S.Price) v is not succinct
Optimization: If C is succinct, C is pre-counting
pushable
25
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} <
5 26
Constrained Apriori : Push a Succinct
Constraint Deep
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5}
{2 3} 2 to be used
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price } <=
1 27
Constrained FP-Growth: Push a Succinct
Constraint Deep
1-Projected DB
TID Items
100 3 4 No Need to project on 2, 3, or 5
300 2 3 5
Constraint:
min{S.price } <=
1 28
Data Anti-monotonic Constraint
Deep
Remove from data
TID Items TID Items
100 134 100 1 3
200 235 300 1 3
FP-Tree
300 1235
400 25
Constraint:
min{S.price } <=
1 29
Constrained FP-Growth: Push
TID Transaction
a Data Anti-monotonic
10 a, b, c, d, f, h
Constraint Deep 20 b, c, d, f, g, h
30 b, c, d, f, g
TID Transaction
40 a, c, e, f, g
10 a, b, c, d, f,
h Item Profit
20 b, c, d, f, g, FP-Tree a 40
h
b 0
30 b, c, d, f, g
c -20
B-Projected
40 DB
a, c, e, f, g
Recursive
Data
TID Transaction Pruning
d -15
10 a, c, d, f, h e -30
20 c, d, f, g, h B f -10
30 c, d, f, g FP-Tree
g 20
h -5
Single branch: Constraint:
range{S.price } >
bcdfg: 2 25
min_sup >= 2
30
Convertible Constraints: Ordering Data in
Transactions
TDB (min_sup=2)
TID Transaction
Convert tough constraints into anti-
10 a, b, c, d, f
monotone or monotone by properly
20 b, c, d, f, g, h
ordering items 30 a, c, d, e, f
Examine C: avg(S.profit) 25 40 c, e, f, g
Order items in value-descending Item Profit
order a 40
b 0
<a, f, g, d, b, h, c, e>
c -20
If an itemset afb violates C d 10
So does afbh, afb* e -30
f 30
It becomes anti-monotone! g 20
h -10
31
Strongly Convertible Constraints
33
Pattern Space Pruning w. Convertible
Constraints
Item Value
C: avg(X) >= 25, min_sup=2
a 40
List items in every transaction in value f 30
descending order R: <a, f, g, d, b, h, c, e>
g 20
C is convertible anti-monotone w.r.t. R
d 10
Scan TDB once b 0
remove infrequent items
h -10
Item h is dropped c -20
Itemsets a and f are good, … e -30
TDB (min_sup=2)
Projection-based mining
TID Transaction
Imposing an appropriate order on item
10 a, f, d, b, c
projection
Many tough constraints can be
20 f, g, d, b, c
30 a, f, d, c,
converted into (anti)-monotone e
40 f, g, h, c, 34
Handling Multiple Constraints
36
Constraint-Based Mining — A General
Picture
sum(S) v ( a S, a 0 ) yes no no
sum(S) v ( a S, a 0 ) no yes no
range(S) v yes no no
range(S) v no yes no
support(S) no yes no
37
Advanced Frequent Pattern Mining
Summary
38
Mining Colossal Frequent Patterns
F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal
Frequent Patterns by Core Pattern Fusion”, ICDE'07.
We have many algorithms, but can we mine large (i.e., colossal)
patterns? ― such as just size around 50 to 100? Unfortunately, not!
Why not? ― the curse of “downward closure” of frequent patterns
The “downward closure” property
Any sub-pattern of a frequent pattern is frequent.
Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1,
a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There
are about 2100 such frequent itemsets!
No matter using breadth-first search (e.g., Apriori) or depth-first
search (FPgrowth), we have to examine so many patterns
Thus the downward closure property leads to explosion!
39
Colossal Patterns: A Motivating
Example
Let’s make a set of 40 transactions Closed/maximal patterns may
T1 = 1 2 3 4 ….. 39 40 partially alleviate the problem but not
T2 = 1 2 3 4 ….. 39 40 really solve it: We often need to
: . mine scattered large patterns!
: .
: . Let the minimum support threshold
: . σ= 20
T40=1 2 3 4 ….. 39 40 40
There are 20 frequent patterns of
size 20
Then delete the items on the diagonal
Each is closed and maximal
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40 # patterns = n 2n
: . 2 /
: . n / 2 n
: . The size of the answer set is
: . exponential to n
T40=1 2 3 4 …… 39
40
Colossal Pattern Set: Small but Interesting
Transaction Database D
A colossal pattern α
α D
α1 Dαk
α2
D
Dα1
α
Dα2
αk
45
Robustness of Colossal Patterns
Core Patterns
Intuitively, for a frequent pattern α, a subpattern β is a τ-core
pattern of α if β shares a similar support set with α, i.e.,
| D |
0 1
| D |
46
Example: Core Patterns
A colossal pattern has far more core patterns than a small-sized
pattern
A colossal pattern has far more core descendants of a smaller size c
A random draw from a complete set of pattern of size c would more
likely to pick a core descendant of a colossal pattern
A colossal pattern can be generated by merging a set of core patterns
Transaction (# of Core Patterns (τ = 0.5)
Ts)
(abe) (100) (abe), (ab), (be), (ae), (e)
(bcf) (100) (bcf), (bc), (bf)
(acf) (100) (acf), (ac), (af)
(abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e),
(abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce),
(bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe),
(abcef)
47
Colossal Patterns Correspond to Dense Balls
49
Idea of Pattern-Fusion Algorithm
53
Experimental Setting
Synthetic data set
Diagn an n x (n-1) table where ith row has integers from 1 to n
except i. Each row is taken as an itemset. min_support is n/2.
Real data set
Replace: A program trace data set collected from the
“replace” program, widely used in software engineering
research
ALL: A popular gene expression data set, a clinical data on
ALL-AML leukemia (www.broad.mit.edu/tools/data.html).
Each item is a column, representing the activitiy level of
gene/protein in the same
Frequent pattern would reveal important correlation
between gene expression patterns and disease outcomes
54
Experiment Results on Diagn
LCM run time increases
exponentially with pattern
size n
Pattern-Fusion finishes
efficiently
The approximation error
of Pattern-Fusion (with
min-sup 20) in
comparison with the
complete set) is rather
close to uniform sampling
(which randomly picks K
patterns from the
complete answer set)
55
Experimental Results on ALL
ALL: A popular gene expression data set with 38
transactions, each with 866 columns
There are 1736 items in total
56
Experimental Results on REPLACE
REPLACE
A program trace data set, recording 4395
Summary
59
Mining Compressed Patterns: δ-
clustering
Why compressed patterns? ID Item-Sets Support
Summary
62
How to Understand and Interpret Patterns?
Semantic
Information
Not all frequent patterns are useful, only meaningful
ones …
Non-semantic info.
Definitions indicating
semantics
Examples of Usage
Synonyms
Related Words
Semantic Analysis with Context
Models
Summary
67
Summary
Roadmap: Many aspects & extensions on pattern
mining
Mining patterns in multi-level, multi dimensional space
Mining rare and negative patterns
Constraint-based pattern mining
Specialized methods for mining high-dimensional data
and colossal patterns
Mining compressed or approximate patterns
Pattern exploration and understanding: Semantic
annotation of frequent patterns
68
Ref: Mining Multi-Level and Quantitative
Rules
Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association
Rules, KDD'99
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using
two-dimensional optimized association rules: Scheme, algorithms, and
visualization. SIGMOD'96.
J. Han and Y. Fu. Discovery of multiple-level association rules from large
databases. VLDB'95.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97.
R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95.
R. Srikant and R. Agrawal. Mining quantitative association rules in large
relational tables. SIGMOD'96.
K. Wang, Y. He, and J. Han. Mining frequent itemsets using support
constraints. VLDB'00
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing
optimized rectilinear regions for association rules. KDD'97.
69
Ref: Mining Other Kinds of Rules
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
paradigm for fast, quantifiable data mining. VLDB'98
Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient Discovery of
Functional and Approximate Dependencies Using Partitions. ICDE’98.
H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and Pattern
Extraction with Fascicles. VLDB'99
B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association
rules. VLDB'96.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative
associations in a large database of customer transactions. ICDE'98.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov.
Query flocks: A generalization of association-rule mining. SIGMOD'98.
70
Ref: Constraint-Based Pattern Mining
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item
constraints. KDD'97
R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning
optimizations of constrained association rules. SIGMOD’98
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained
correlated sets. ICDE'00
J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with
Convertible Constraints. ICDE'01
J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with Constraints in
Large Databases, CIKM'02
F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. ExAnte: Anticipated
Data Reduction in Constrained Pattern Mining, PKDD'03
F. Zhu, X. Yan, J. Han, and P. S. Yu, “gPrune: A Constraint Pushing
Framework for Graph Pattern Mining”, PAKDD'07
71
Ref: Mining Sequential Patterns
X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns with
gap constraints. ICDM'05
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01.
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. EDBT’96.
X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine
Learning:01.
72
Mining Graph and Structured
Patterns
A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for
mining frequent substructures from graph data. PKDD'00
M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
X. Yan and J. Han. gSpan: Graph-based substructure pattern mining.
ICDM'02
X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns.
KDD'03
X. Yan, P. S. Yu, and J. Han. Graph indexing based on discriminative frequent
structure analysis. ACM TODS, 30:960–993, 2005
X. Yan, F. Zhu, P. S. Yu, and J. Han. Feature-based substructure similarity
search. ACM Trans. Database Systems, 31:1418–1453, 2006
73
Ref: Mining Spatial, Spatiotemporal, Multimedia
Data
H. Cao, N. Mamoulis, and D. W. Cheung. Mining frequent spatiotemporal
sequential patterns. ICDM'05
D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal Patterns.
SSTD'01
K. Koperski and J. Han, Discovery of Spatial Association Rules in Geographic
Information Databases, SSD’95
H. Xiong, S. Shekhar, Y. Huang, V. Kumar, X. Ma, and J. S. Yoo. A framework
for discovering co-location patterns in data sets with extended spatial
objects. SDM'04
J. Yuan, Y. Wu, and M. Yang. Discovery of collocation patterns: From visual
words to visual phrases. CVPR'07
O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Multimedia with
Progressive Resolution Refinement. ICDE'00
74
Ref: Mining Frequent Patterns in Time-Series
Data
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98.
J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series
Database, ICDE'99.
J. Shieh and E. Keogh. iSAX: Indexing and mining terabyte sized time series. KDD'08
B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online
Data Mining for Co-Evolving Time Sequences. ICDE'00.
W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on Evolving Numerical
Attributes. ICDE’01.
J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in Time Series Data.
TKDE’03
L. Ye and E. Keogh. Time series shapelets: A new primitive for data mining. KDD'09
75
Ref: FP for Classification and
Clustering
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. KDD'99.
B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association Rule
Mining. KDD’98.
W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient Classification Based
on Multiple Class-Association Rules. ICDM'01.
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in
large data sets. SIGMOD’ 02.
J. Yang and W. Wang. CLUSEQ: efficient and effective sequence
clustering. ICDE’03.
X. Yin and J. Han. CPAR: Classification based on Predictive Association
Rules. SDM'03.
H. Cheng, X. Yan, J. Han, and C.-W. Hsu, Discriminative Frequent Pattern
Analysis for Effective Classification”, ICDE'07
76
Ref: Privacy-Preserving FP Mining
77
Mining Compressed Patterns
D. Xin, H. Cheng, X. Yan, and J. Han. Extracting redundancy-
aware top-k patterns. KDD'06
D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed
frequent-pattern sets. VLDB'05
X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset
patterns: A profile-based approach. KDD'05
78
Mining Colossal Patterns
F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng. Mining colossal
frequent patterns by core pattern fusion. ICDE'07
F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han. P. S. Yu, Mining Top-K Large
Structural Patterns in a Massive Network. VLDB’11
79
Ref: FP Mining from Data Streams
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-Dimensional
Regression Analysis of Time-Series Data Streams. VLDB'02.
R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for
finding frequent elements in streams and bags. TODS 2003.
G. Manku and R. Motwani. Approximate Frequency Counts over Data
Streams. VLDB’02.
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of
frequent and top-k elements in data streams. ICDT'05
80
Ref: Freq. Pattern Mining Applications
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or
How to Build a Data Quality Browser. SIGMOD'02
M. Khan, H. Le, H. Ahmadi, T. Abdelzaher, and J. Han. DustMiner: Troubleshooting
interactive complexity bugs in sensor networks., SenSys'08
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related
bugs in operating system code. In Proc. 2004 Symp. Operating Systems Design and
Implementation (OSDI'04)
Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and
detecting violations in large software code. FSE'05
D. Lo, H. Cheng, J. Han, S. Khoo, and C. Sun. Classification of software behaviors for failure
detection: A discriminative pattern mining approach. KDD'09
Q. Mei, D. Xin, H. Cheng, J. Han, and C. Zhai. Semantic annotation of frequent patterns.
ACM TKDD, 2007.
K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT’02.
81