P8 FPBasic
P8 FPBasic
Slides adopted from Jiawei Han, Computer Science, Univ. Illinois at Urbana-Champaign, 2017
1
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
q Basic Concepts
q Pattern Evaluation
q Summary
2
Pattern Discovery: Basic Concepts
3
What Is Pattern Discovery?
q What are patterns?
q Patterns: A set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set
q Patterns represent intrinsic and important properties of datasets
q Pattern discovery: Uncovering patterns from massive data sets
q Motivation examples:
q What products were often purchased together?
q What are the subsequent purchases after buying an iPad?
q What code segments likely contain copy-and-paste bugs?
q What word sequences likely form phrases in this corpus?
4
Pattern Discovery: Why Is It Important?
q Finding inherent regularities in a data set
q Foundation for many essential data mining tasks
q Association, correlation, and causality analysis
q Mining sequential, structural (e.g., sub-graph) patterns
q Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
q Classification: Discriminative pattern-based analysis
q Cluster analysis: Pattern-based subspace clustering
q Broad applications
q Market basket analysis, cross-marketing, catalog design, sale
campaign analysis, Web log analysis, biological sequence
5
Basic Concepts: k-Itemsets and Their Supports
q Itemset: A set of one or more items Tid Items bought
10 Beer, Nuts, Diaper
q k-itemset: X = {x1, …, xk}
20 Beer, Coffee, Diaper
q Ex. {Beer, Nuts, Diaper} is a 3-itemset
30 Beer, Diaper, Eggs
q (absolute) support (count) of X, sup{X}: 40 Nuts, Eggs, Milk
Frequency or the number of 50 Nuts, Coffee, Diaper, Eggs, Milk
occurrences of an itemset X
q (relative) support, s{X}: The fraction of
q Ex. sup{Beer} = 3
transactions that contains X (i.e., the
q Ex. sup{Diaper} = 4
probability that a transaction contains X)
q Ex. sup{Beer, Diaper} = 3
q Ex. s{Beer} = 3/5 = 60%
q Ex. sup{Beer, Eggs} = 1
q Ex. s{Diaper} = 4/5 = 80%
q Ex. s{Beer, Eggs} = 1/5 = 20%
6
Basic Concepts: Frequent Itemsets (Patterns)
q An itemset (or a pattern) X is frequent Tid Items bought
if the support of X is no less than a 10 Beer, Nuts, Diaper
minsup threshold σ 20 Beer, Coffee, Diaper
7
From Frequent Itemsets to Association Rules
q Comparing with itemsets, rules can be more telling Tid Items bought
10 Beer, Nuts, Diaper
q Ex. Diaper à Beer
20 Beer, Coffee, Diaper
q Buying diapers may likely lead to buying beers
30 Beer, Diaper, Eggs
q How strong is this rule? (support, confidence) 40 Nuts, Eggs, Milk
q Measuring association rules: X à Y (s, c) 50 Nuts, Coffee, Diaper, Eggs, Milk
q Both X and Y are itemsets Containing Containing
both diaper
q Support, s: The probability that a transaction
contains X È Y Beer {Beer} È Diaper
{Diaper}
q Ex. s{Diaper, Beer} = 3/5 = 0.6 (i.e., 60%)
q Confidence, c: The conditional probability that a Containing beer
transaction containing X also contains Y {Beer} È {Diaper} = {Beer, Diaper}
10
Expressing Patterns in Compressed Form: Closed Patterns
q How to handle such a challenge?
q Solution 1: Closed patterns: A pattern (itemset) X is closed if X is
frequent, and there exists no super-pattern Y כX, with the same
support as X
q Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
q Suppose minsup = 1. How many closed patterns does TDB1
contain?
q Two: P1: “{a1, …, a50}: 2”; P2: “{a1, …, a100}: 1”
q Closed pattern is a lossless compression of frequent patterns
q Reduces the # of patterns but does not lose the support
information!
11
q You will still be able to say: “{a2, …, a40}: 2”, “{a5, a51}: 1”
Expressing Patterns in Compressed Form: Max-Patterns
q Solution 2: Max-patterns: A pattern X is a max-pattern if X is
frequent and there exists no frequent super-pattern Y כX
q Difference from close-patterns?
q Do not care the real support of the sub-patterns of a max-pattern
q Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
q Suppose minsup = 1. How many max-patterns does TDB1 contain?
q One: P: “{a1, …, a100}: 1”
q Max-pattern is a lossy compression!
q We only know {a1, …, a40} is frequent
q But we do not know the real support of {a1, …, a40}, …, any more!
q Thus in many applications, mining close-patterns is more desirable
than mining max-patterns
12
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
q Basic Concepts
q Pattern Evaluation
q Summary
13
Efficient Pattern Mining Methods
q The Downward Closure Property of Frequent Patterns
14
The Downward Closure Property of Frequent Patterns
q Observation: From TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
q We get a frequent itemset: {a1, …, a50}
q Also, its subsets are all frequent: {a1}, {a2}, …, {a50}, {a1, a2}, …, {a1, …, a49}, …
q There must be some hidden relationships among frequent patterns!
q The downward closure (also called “Apriori”) property of frequent patterns
q If {beer, diaper, nuts} is frequent, so is {beer, diaper}
q Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
q Apriori: Any subset of a frequent itemset must be frequent
q Efficient mining methodology
q If any subset of an itemset S is infrequent, then there is no chance for S to
be frequent—why do we even have to consider S!? A sharp knife for pruning!
15
Apriori Pruning and Scalable Mining Methods
q Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not even be generated! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
q Scalable mining Methods: Three major approaches
q Level-wise, join-based approach: Apriori (Agrawal &
Srikant@VLDB’94)
q Vertical data format approach: Eclat (Zaki, Parthasarathy,
Ogihara, Li @KDD’97)
q Frequent pattern projection and growth: FPgrowth (Han, Pei,
Yin @SIGMOD’00)
16
Apriori: A Candidate Generation & Test Approach
q Outline of Apriori (level-wise, candidate generation and test)
q Initially, scan DB once to get frequent 1-itemset
q Repeat
q Generate length-(k+1) candidate itemsets from length-k frequent
itemsets
q Test the candidates against DB to find frequent (k+1)-itemsets
q Set k := k +1
q Until no frequent or candidate set can be generated
q Return all the frequent itemsets derived
17
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Fk : Frequent itemset of size k
K := 1;
Fk := {frequent items}; // frequent 1-itemset
While (Fk != Æ) do { // when Fk is non-empty
Ck+1 := candidates generated from Fk; // candidate generation
Derive Fk+1 by counting candidates in Ck+1 with respect to TDB at minsup;
k := k + 1
}
return Èk Fk // return Fk generated at each level
18
The Apriori Algorithm—An Example
Database TDB minsup = 2 Itemset sup
Itemset sup
{A} 2 F1
Tid Items C1 {B} 3
{A} 2
10 A, C, D {B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2
2nd scan {A, C}
{B, C} 2 {A, E} 1
{A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}
C3 Itemset
3rd scan F3 Itemset sup
{B, C, E} {B, C, E} 2
19
Apriori: Implementation Tricks
q How to generate candidates?
self-join self-join
q Step 1: self-joining Fk
abc abd acd ace bcd
q Step 2: pruning
q Example of candidate-generation abcd acde
q F3 = {abc, abd, acd, ace, bcd}
pruned
q Self-joining: F3*F3
q abcd from abc and abd
q acde from acd and ace
q Pruning:
q acde is removed because ade is not in F3
q C4 = {abcd}
20
Candidate Generation: An SQL Implementation
self-join self-join
q Suppose the items in Fk-1 are listed
abc abd acd ace bcd
in an order
q Step 1: self-joining Fk-1 abcd acde
insert into Ck
pruned
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Fk-1 as p, Fk-1 as q
where p.item1= q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
q Step 2: pruning
for all itemsets c in Ck do
for all (k-1)-subsets s of c do
if (s is not in Fk-1) then delete c from Ck
21
Apriori: Improvements and Alternatives
q Reduce passes of transaction database scans
To be discussed in
q Partitioning (e.g., Savasere, et al., 1995) subsequent slides
q Dynamic itemset counting (Brin, et al., 1997)
q Shrink the number of candidates
To be discussed in
q Hashing (e.g., DHP: Park, et al., 1995) subsequent slides
q Pruning by support lower bounding (e.g., Bayardo 1998)
q Sampling (e.g., Toivonen, 1996)
q Exploring special data structures
q Tree projection (Agarwal, et al., 2001)
q H-miner (Pei, et al., 2001)
q Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
22
Partitioning: Scan Database Only Twice
q Theorem: Any itemset that is potentially frequent in TDB must be frequent in at least
one of the partitions of TDB
Here is the p
roof!
q Hash 2-itemsets in the transaction to its bucket {bd, be, de} 298
24
Exploring Vertical Data Format: ECLAT
q ECLAT (Equivalence Class Transformation): A depth-first search A transaction DB in Horizontal
Data Format
algorithm using set intersection [Zaki et al. @KDD’97]
Tid Itemset
q Tid-List: List of transaction-ids containing an itemset 10 a, c, d, e
20 a, b, e
q Vertical format: t(e) = {T10, T20, T30}; t(a) = {T10, T20}; t(ae) = {T10, T20}
30 b, c, e
q Properties of Tid-Lists
The transaction DB in Vertical
q t(X) = t(Y): X and Y always happen together (e.g., t(ac} = t(d}) Data Format
Item TidList
q t(X) Ì t(Y): transaction having X always has Y (e.g., t(ac) Ì t(ce))
a 10, 20
q Deriving frequent patterns based on vertical intersections b 20, 30
q Using diffset to accelerate mining c 10, 30
d 10
q Only keep track of differences of tids
e 10, 20, 30
q t(e) = {T10, T20, T30}, t(ce) = {T10, T30} → Diffset (ce, e) = {T20}
25
Why Mining Frequent Patterns by Pattern Growth?
q Apriori: A breadth-first search mining algorithm
q First find the complete set of frequent k-itemsets
q Then derive frequent (k+1)-itemset candidates
q Scan DB again to find true frequent (k+1)-itemsets
q Motivation for a different mining methodology
q Can we develop a depth-first search mining algorithm?
q For a frequent itemset ρ, can subsequent search be confined
to only those transactions that containing ρ?
q Such thinking leads to a frequent pattern growth approach:
q FPGrowth (J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns
without Candidate Generation,” SIGMOD 2000)
26
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
After inserting the 1st frequent
300 {b, f, h, j, o, w} f, b
Itemlist: “f, c, a, m, p”
400 {b, c, k, s, p} c, b, p
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:1
a 3
3. Scan DB again, construct FP-tree
b 3
q The frequent itemlist of each transaction is m:1
inserted as a branch, with shared sub- m 3
branches merged, counts accumulated p 3 p:1
27
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting the 2nd frequent
400 {b, c, k, s, p} c, b, p itemlist “f, c, a, b, m”
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:2
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:2
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:2
a 3
3. Scan DB again, construct FP-tree
b 3
q The frequent itemlist of each transaction is m:1 b:1
inserted as a branch, with shared sub- m 3
branches merged, counts accumulated p 3 p:1 m:1
28
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting all the
400 {b, c, k, s, p} c, b, p frequent itemlists
500 {a, f, c, e, l, p, m, n} f, c, a, m, p
{}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:4 c:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:3 b:1 b:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:3 p:1
a 3
3. Scan DB again, construct FP-tree
b 3
q The frequent itemlist of each transaction is m:2 b:1
inserted as a branch, with shared sub- m 3
branches merged, counts accumulated p 3 p:2 m:1
29
Mining FP-Tree: Divide and Conquer
Based on Patterns and Data
q Pattern mining can be partitioned according to current patterns
q Patterns containing p: p’s conditional database: fcam:2, cb:1
q p’s conditional database (i.e., the database under the condition that p exists):
q transformed prefix paths of item p
q Patterns having m but no p: m’s conditional database: fca:2, fcab:1
q …… …… {}
min_support = 3 Conditional database of each pattern
32
FPGrowth: Mining Frequent Patterns by Pattern Growth
q Essence of frequent pattern growth (FPGrowth) methodology
q Find frequent single items and partition the database based on each
such single item pattern
q Recursively grow frequent patterns by doing the above for each
partitioned database (also called the pattern’s conditional database)
q To facilitate efficient processing, an efficient data structure, FP-tree, can
be constructed
q Mining becomes
q Recursively construct and mine (conditional) FP-trees
q Until the resulting FP-tree is empty, or until it contains only one path—
single path will generate all the combinations of its sub-paths, each of
which is a frequent pattern
33
Scaling FP-growth by Item-Based Data Projection
q What if FP-tree cannot fit in memory?—Do not construct FP-tree
q “Project” the database based on frequent single items
q Construct & mine FP-tree for each projected DB
q Parallel projection vs. partition projection
q Parallel projection: Project the DB on each frequent item
q Space costly, all partitions can be processed in parallel
q Partition projection: Partition the DB in order
q Passing the unprocessed parts to subsequent partitions
Trans. DB Parallel projection Partition projection
q Basic Concepts
q Pattern Evaluation
q Summary
36
Pattern Evaluation
q Null-Invariant Measures
37
How to Judge if a Rule/Pattern Is Interesting?
q Pattern-mining will generate a large set of patterns/rules
q Not all the generated patterns/rules are interesting
q Interestingness measures: Objective vs. subjective
q Objective interestingness measures
q Support, confidence, correlation, …
q Subjective interestingness measures:
q Different users may judge interestingness differently
Jaccard, consine,
AllConf, MaxConf,
and Kulczynski
are null-invariant
measures
43
Null Invariance: An Important Property
q Why is null invariance crucial for the analysis of massive transaction data?
q Many transactions may contain neither milk nor coffee!
milk vs. coffee contingency table q Lift and c2 are not null-invariant: not good to
evaluate data that contain too many or too
few null transactions!
q Many measures are not null-invariant!
Null-transactions
w.r.t. m and c
44
Comparison of Null-Invariant Measures
q Not all null-invariant measures are created equal
q Which one is better? 2-variable contingency table
q D4—D6 differentiate the null-invariant measures
q Kulc (Kulczynski 1927) holds firm and is in balance of
both directional implications
All 5 are null-invariant
45
Analysis of DBLP Coauthor Relationships
qDBLP: Computer science research publication bibliographic database
q > 3.8 million entries on authors, paper, venue, year, and other information
qKulczynski and Imbalance Ratio (IR) together present a clear picture for all
the three datasets D4 through D6
q D4 is neutral & balanced; D5 is neutral but imbalanced
q D6 is neutral but very imbalanced
47
What Measures to Choose for Effective Pattern Evaluation?
q Null value cases are predominant in many large datasets
q Neither milk nor coffee is in most of the baskets; neither Mike nor Jim is an author
in most of the papers; ……
q Null-invariance is an important property
q Lift, χ2 and cosine are good measures if null transactions are not predominant
q Otherwise, Kulczynski + Imbalance Ratio should be used to judge the
interestingness of a pattern
q Exercise: Mining research collaborations from research bibliographic data
q Find a group of frequent collaborators from research bibliographic data (e.g., DBLP)
q Can you find the likely advisor-advisee relationship and during which years such a
relationship happened?
q Ref.: C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo, "Mining Advisor-
Advisee Relationships from Research Publication Networks", KDD'10
48
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
q Basic Concepts
q Pattern Evaluation
q Summary
49
Summary
q Basic Concepts
q What Is Pattern Discovery? Why Is It Important?
q Basic Concepts: Frequent Patterns and Association Rules
q Compressed Representation: Closed Patterns and Max-Patterns
q Efficient Pattern Mining Methods
q The Downward Closure Property of Frequent Patterns
q The Apriori Algorithm
q Extensions or Improvements of Apriori
q Mining Frequent Patterns by Exploring Vertical Data Format
q FPGrowth: A Frequent Pattern-Growth Approach
q Mining Closed Patterns
q Pattern Evaluation
q Interestingness Measures in Pattern Mining
q Interestingness Measures: Lift and χ2
q Null-Invariant Measures
q Comparison of Interestingness Measures
50
Recommended Readings (Basic Concepts)
q R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of
items in large databases”, in Proc. of SIGMOD'93
q R. J. Bayardo, “Efficiently mining long patterns from databases”, in Proc. of
SIGMOD'98
q N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets
for association rules”, in Proc. of ICDT'99
q J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent Pattern Mining: Current Status and
Future Directions”, Data Mining and Knowledge Discovery, 15(1): 55-86, 2007
51
Recommended Readings (Efficient Pattern Mining Methods)
q R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
q A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large
databases”, VLDB'95
q J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”,
SIGMOD'95
q S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database
systems: Alternatives and implications”, SIGMOD'98
q M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association
rules”, Data Mining and Knowledge Discovery, 1997
q J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
q M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
q J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed
Itemsets”, KDD'03
q C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in
Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014
52
Recommended Readings (Pattern Evaluation)
q C. C. Aggarwal and P. S. Yu. A New Framework for Itemset Generation. PODS’98
q S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97
q M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94
q E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03
q P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for
Association Patterns. KDD'02
q T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397,
2010
53