Mining Frequent Patterns and Associations
Mining Frequent Patterns and Associations
Evaluation Methods
■ Summary
1
What Is Frequent Pattern Analysis?
■ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
■ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
■ Motivation: Finding inherent regularities in data
■ What products were often purchased together?— Milk and bread?
■ What are the subsequent purchases after buying a PC?
■ What kinds of DNA are sensitive to this new drug?
■ Can we automatically classify web documents?
■ Applications
■ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
2
Why Is Freq. Pattern Mining Important?
■ Broad applications
3
Basic Concepts: Frequent Patterns
4
Basic Concepts: Association Rules
6
Definitions
■ A set of items is referred as ‘Itemset’.
■ An itemset that contains ‘k’ items is known as
‘k-itemsets’.
■ Support count/frequency/count/absolute support:
■ The number of transactions that contains the
itemset.
■ Frequent itemset:
■ If a support count of an itemset satisfies the
8
Closed Patterns and Max-Patterns
■ A long pattern contains a combinatorial number of
sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) +
… + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
■ Solution: Mine closed patterns and max-patterns instead
9
Closed Patterns and Max-Patterns
■ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
■ Min_sup = 1.
■ What is the set of closed itemset?
■ <a1, …, a100>: 1
■ < a1, …, a50>: 2
■ What is the set of max-pattern?
■ <a1, …, a100>: 1
■ What is the set of all patterns?
■ !!
10
Computational Complexity of Frequent Itemset Mining
11
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
12
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test
Approach
Data Format
13
The Downward Closure Property and Scalable
Mining Methods
■ The downward closure property of frequent patterns
■ Any subset of a frequent itemset must be frequent
14
Apriori: A Candidate Generation & Test Approach
15
Apriori: A Candidate Generation & Test Approach
16
Apriori: A Candidate Generation & Test Approach
Iteration 2:
Join Step:
1. Generate 2-itemsets using Join method (All possible
combinations in chronological order) (C2).
Prune Step:
2. Eliminate itemsets from C2 using Apriori property by
checking L1.
3. Find the support count of all remaining 2-itemsets. (Second
Scan of D)
4. Generate Frequent 2-itemsets (L2)
17
Apriori: A Candidate Generation & Test Approach
18
Apriori: A Candidate Generation & Test Approach
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
21
Implementation of Apriori
23
Further Improvement of the Apriori Method
24
Partition: Scan Database Only Twice
■ Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
■ Scan 1: partition database and find local frequent
patterns
■ Scan 2: consolidate global frequent patterns
26
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
■ Bottlenecks of the Apriori approach
■ Breadth-first (i.e., level-wise) search
■ Candidate generation and test
■ Often generates a huge number of candidates
■ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
■ Depth-first search
■ Avoid explicit candidate generation
■ Major philosophy: Grow long patterns from short ones using local
frequent items only
■ “abc” is a frequent pattern
■ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
■ “d” is a local frequent item in DB|abc abcd is a frequent pattern
27
Construct FP-tree from a Transaction Database
■ Patterns containing p
■ …
■ Pattern f
29
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 itemcond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
30
From Conditional Pattern-bases to Conditional FP-trees
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
32
A Special Case: Single Prefix Path in FP-tree
■ Completeness
■ Preserve complete information for frequent pattern
mining
■ Never break a long pattern of any transaction
■ Compactness
■ Reduce irrelevant info—infrequent items are gone
■ Items in frequency descending order: the more
frequently occurring, the more likely to be shared
■ Never be larger than the original database (not count
node-links and the count field)
34
The Frequent Pattern Growth Mining Method
database partition
■ Method
■ For each frequent item, construct its conditional
FP-tree
■ Until the resulting FP-tree is empty, or it contains only
35
Scaling FP-growth by Database Projection
36
Advantages of the Pattern Growth Approach
■ Divide-and-conquer:
■ Decompose both the mining task and DB according to the
frequent patterns obtained so far
■ Lead to focused search of smaller databases
■ Other factors
■ No candidate generation, no candidate test
■ Compressed database: FP-tree structure
■ No repeated scan of entire database
■ Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
■ A good open-source implementation and refinement of FPGrowth
■ FPGrowth+ (Grahne and J. Zhu, FIMI'03)
37
Further Improvements of Mining Methods
38
Extension of Pattern Growth Mining Methodology
40
ECLAT: Mining by Exploring Vertical Data Format
■ Vertical format: t(AB) = {T11, T25, …}
■ tid-list: list of trans.-ids containing an itemset
■ Deriving frequent patterns based on vertical intersections
■ t(X) = t(Y): X and Y always happen together
■ t(X) ⊂ t(Y): transaction having X always has Y
■ Using diffset to accelerate mining
■ Only keep track of differences of tids
■ t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
■ Diffset (XY, X) = {T2}
■ Eclat (Zaki et al. @KDD’97)
■ Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
41
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test Approach
42
Visualization of Association Rules
(SGI/MineSet 3.0)
43
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
44
Interestingness Measure: Correlations (Lift)
■ play basketball ⇒ eat cereal [40%, 66.7%] is misleading
■ The overall % of students eating cereal is 75% > 66.7%.
■ play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
■ Measure of dependent/correlated events: lift
45
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
46
Summary
■ Basic concepts: association rules,
support-confident framework, closed and
max-patterns
■ Scalable frequent pattern mining methods
■ Apriori (Candidate generation & test)
■ Projection-based (FPgrowth, CLOSET+, ...)
■ Vertical format approach (ECLAT, CHARM, ...)
▪ Which patterns are interesting?
▪ Pattern evaluation methods
47
Ref: Basic Concepts of Frequent Pattern Mining
■ (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. SIGMOD'93
■ (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
databases. SIGMOD'98
■ (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Discovering frequent closed itemsets for association rules. ICDT'99
■ (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns.
ICDE'95
48
Ref: Apriori and Its Improvements
■ R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94
■ H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
■ A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
■ J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
■ H. Toivonen. Sampling large databases for association rules. VLDB'96
■ S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
■ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications. SIGMOD'98
49
Ref: Depth-First, Projection-Based FP Mining
■ R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation
of frequent itemsets. J. Parallel and Distributed Computing, 2002.
■ G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
FIMI'03
■ B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining
implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov. 2003
■ J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00
■ J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
Projection. KDD'02
■ J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
■ J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets. KDD'03
50
Ref: Vertical Format and Row Enumeration Methods
51
Ref: Mining Correlations and Interesting Rules
■ S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97.
■ M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94.
■ R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic, 2001.
■ C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98.
■ P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. KDD'02.
■ E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
■ T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework", Data Mining and Knowledge Discovery,
21(3):371-397, 2010
52