CS2202_AssociationRuleMining
CS2202_AssociationRuleMining
CS2202
1
ASSOCIATION RULE MINING
Given a set of transactions, find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction
2
THE TASK
Two ways of defining the task
General
Input: A collection of instances
Output: rules to predict the values of any attribute(s) (not just the class attribute) from
values of other attributes
E.g. if temperature = cool then humidity =normal
If the right hand side of a rule has only the class attribute, then the rule is a classification
rule
Distinction: Classification rules are applied together as sets of rules
• A large set of baskets, each of which is a small set of the items, e.g., the
items one customer buys on one day
4
Market-Baskets – (2)
5
EXAMPLE
Market-Basket transactions
TID Items Example of Association Rules
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Diaper} {Beer},
3 Milk, Diaper, Beer, Coke
{Milk, Bread} {Eggs,Coke},
4 Bread, Milk, Diaper, Beer {Beer, Bread} {Milk},
5 Bread, Milk, Diaper, Coke
Implication means co-occurrence,
not causality!
{Diaper}->{Beer}: In a significant number of transactions where Diaper is present,
Beer is also present. But this does not mean that diapers cause people to buy beer
6
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule
Association Rule TID Items
– An implication expression of the form X Y, 1 Bread, Milk
where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper} {Beer} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
9
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper} {Beer} (s=0.4, c=0.67)
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Milk,Beer} {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer} {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer} {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence measures
• Thus, we may decouple the support and confidence requirements
10
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
11
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
R 3d 2 d 1 1
14
Frequent Itemset Generation Strategies
15
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent
• Apriori principle holds due to the following property of the support measure:
X , Y : ( X Y ) s( X ) s(Y )
16
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned ABCDE
supersets
17
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 (No need to generate
{Milk,Beer} 2
{Milk,Diaper} 3
candidates involving Coke
{Beer,Diaper} 3 or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13
18
Apriori Algorithm
• Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that are
frequent
19
THE APRIORI ALGORITHM: BASIC IDEA
Join Step: Ck is generated by joining^ Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset
Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
^Join of Lk and Lk requires the two joining itemsets to share k-1 items.
20
THE APRIORI ALGORITHM — EXAMPLE MinSup=2
Database D
TID Items itemset sup.
L1 itemset sup.
100 1 3 4 C1 {1} 2 {1} 2
200 2 3 5 {2} 3
Scan D {2} 3
300 1 2 3 5 {3} 3
400 2 5 {3} 3
{4} 1 {5} 3
{5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2 21
Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of each candidate
itemset
– To reduce the number of comparisons, store the candidates in a hash structure
• Instead of matching each transaction against every candidate, match it
against candidates contained in the hashed buckets
size 3? Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
25
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may
also increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with
number of transactions
• Average transaction width
– transaction width increases with denser data sets
– this may increase max length of frequent itemsets and traversals of hash tree
26
(number of subsets in a transaction increases with its width)
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its immediate supersets is
frequent null
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets
ABCD
E Border
27
Closed Itemset
• An itemset is closed if none of its immediate supersets has the same support as the
itemset
Itemset Support
TID Items {A} 4
Itemset Support
1 {A,B} {B} 5
{A,B,C} 2
2 {B,C,D} {C} 3
{A,B,D} 3
3 {A,B,C,D} {D} 4
{A,C,D} 2
4 {A,B,D} {A,B} 4
{B,C,D} 3
5 {A,B,C,D} {A,C} 2
{A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3
28
Maximal vs Closed Itemsets
TID Items null Transaction
1 ABC
124 123 1234 245
Ids 345
2 ABCD A B C D E
3 BCE
4 ACDE 12 124 24 123
4 2 3 24 34 45
5 DE AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
29
Maximal vs Closed Frequent Itemsets Closed
but not
Minimum support = 2 maximal
null
Closed
12 124 24 4 123 2
and
maximal
3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
# Closed = 9
ABCD ABCE ABDE ACDE BCDE
ABCDE # Maximal = 4
30
Compact Representation of Frequent Itemsets
• Some itemsets are redundant because they have identical support as their
supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
10
3
10
32
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general
Frequent
itemset Frequent
border null null itemset null
border
.. .. ..
.. .. ..
Frequent
{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional
33
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– Equivalent Classes (two itemsets belong to the same class if they share
same common prefix or suffix)
null null
A B C D A B C D
AB AC AD BC BD CD AB AC BC AD BD CD
ABCD ABCD
34
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– Breadth-first vs Depth-first
– Apriori traverses in BFS manner
– DFS quickly finds maximal frequent set
35
ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION
ECLAT: for each item, store a list of transaction ids (tids); vertical data layout
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
TID-list 36
ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION
Determine support of any k-itemset by intersecting tid-lists of two of its (k-1)
subsets.
A B AB
1 1 1
2 5
4
5 5 7
6 7 8
7 8
8 10
9
37
Transactions Reductions
• Prune irrelevant transactions early
• If a transaction does not contain any frequent k-itemsets, it cannot help in
generating any frequent (k+1)-itemsets.
• This pruning is done after each iteration:
– After processing frequent k-itemsets, remove transactions that contain no
frequent k-itemsets
– Repeat the process for (k+1), (k+2), etc.
38
FP GROWTH ALGORITHM
Apriori: uses a generate-and-test approach – generates candidate itemsets
and tests if they are frequent
Generation of candidate itemsets is expensive(in both space and time)
Support counting is expensive
Subset checking (computationally expensive)
39
Mining Frequent Patterns Without Candidate Generation
40
PHASE 1: FP-TREE CONSTRUCTION
41
STEP 1: FP-TREE CONSTRUCTION
Pass 2:
Nodes correspond to items and have a counter
FP-Growth reads 1 transaction at a time and maps it to a path
Fixed order is used, so paths can overlap when transactions share items
(when they have the same prefix ).
In this case, counters are incremented
Pointers are maintained between nodes containing the same item,
creating singly linked lists (dotted lines)
The more paths that overlap, the higher the compression. FP-tree may
fit in memory.
42
FP-Growth Method : An Example
TID List of Items • Consider the example of a database D,
T1 I1, I2, I5 consisting of 9 transactions.
T2 I2, I4
• Suppose min. support count required is 2
(i.e. min_sup = 2/9 = 22 % )
T3 I2, I3
• The first scan of the database is same as
T4 I1, I2, I4 Apriori, which derives the set of 1-itemsets
T5 I1, I3
& their support counts.
• The set of frequent items is sorted in the
T6 I2, I3
order of descending support count.
T7 I1, I3
• The resulting set is denoted as L = {I2:7,
T8 I1, I2 ,I3, I5 I1:6, I3:6, I4:2, I5:2}
T9 I1, I2, I3
43
Support Count and Sorted Items
Items Support TID Items Sorted Items
I1 6 T1 I1, I2, I5 I2, I1, I5
I2 7 T2 I2, I4 I2,I4
I3 6 T3 I2, I3 I2,I3
I4 2 T4 I1,I2,I4 I2,I1,I4
I5 2 T5 I1,I3 I1,I3
T6 I2,I3 I2,I3
T7 I1,I3 I1,I3
T8 I1,I2,I3,I5 I2,I1,I3,I5
T9 I1,I2,I3 I2,I1,I3
44
FP-Growth Method: Construction of FP-Tree
• First, create the root of the tree, labeled with “null”.
• Scan the database D a second time (First time we scanned it to create 1-itemset
and then L), this will generate the complete tree.
• The items in each transaction are processed in L order (i.e. sorted order).
• A branch is created for each transaction with items having their support count
separated by colon.
• Whenever the same node is encountered in another transaction, we just
increment the support count of the common node or Prefix.
• To facilitate tree traversal, an item header table is built so that each item points to
its occurrences in the tree via a chain of node-links.
• Now, The problem of mining frequent patterns in database is transformed to that
of mining the FP-Tree.
45
FP-Growth Method: Construction of FP-Tree
null{}
Item Sup Node-
Id Count link I2:7 I1:2
I2 7
I1 6 I1:4 I4:1
I3:2
I3 6
I4 2 I3:2
I5 2
I3:2 I4:1
I5:1
I5:1
An FP-Tree that registers compressed, frequent pattern information
46
Mining the FP-Tree by Creating Conditional (sub) pattern
bases
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in
the FP-Tree co-occurring with suffix pattern.
3. Then, construct its conditional FP-Tree & perform mining on this tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the
frequent patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required
frequent itemset.
47
FP-Tree Example Continued
Item Conditional pattern base Conditional Frequent pattern generated
FP-Tree
I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2
I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 2 , I2 I1 I3: 2
49
Why Frequent Pattern Growth Fast ?
• Performance study shows
– FP-growth is an order of magnitude faster than Apriori
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scans
– Basic operation is counting and FP-tree building
50
FP-TREE SIZE
The FP-Tree usually has a smaller size than the uncompressed data -
typically many transactions share items (and hence prefixes).
Best case scenario: all transactions contain the same set of items.
1 path in the FP-tree
Worst case scenario: every transaction has a unique set of items (no
items in common)
Size of the FP-tree is at least as large as the original data.
Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.
The size of the FP-tree depends on how the items are ordered
Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).
51
ADVANTAGES AND DISADVANTAGES
Advantages of FP-Growth
only 2 passes over data-set
“compresses” data-set
no candidate generation
much faster than Apriori
Disadvantages of FP-Growth
FP-Tree may not fit in memory!!
FP-Tree is expensive to build
52
RULE GENERATION
Given a frequent itemset L, find all non-empty subsets f L such that f L – f
satisfies the minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD, BD AC, CD AB,
53
RULE GENERATION
How to efficiently generate rules from frequent itemsets?
In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)
But confidence of rules generated from the same itemset has an anti-monotone
property
e.g., L = {A,B,C,D}:
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
55
COMPUTING INTERESTINGNESS MEASURE
Given a rule X Y, information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table for X Y
Y Y
f11: support of 𝑋 and 𝑌
X f11 f10 f1+
X f01 f00 fo+
f10: support of 𝑋 and 𝑌
f+1 f+0 |T| f01: support of 𝑋 and 𝑌
f00: support of 𝑋 and 𝑌
Used to define various measures
support,confidence, lift, Gini,
J-measure, etc.
56
DRAWBACK OF CONFIDENCE
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
𝐿𝑖𝑓𝑡 = 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋→𝑌)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑌)
58
EXAMPLE: LIFT/INTEREST
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
59