Chap5 Basic Association Analysis
Chap5 Basic Association Analysis
Chapter 5
Association Analysis: Basic Concepts
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
3/8/2021 Introduction to Data Mining, 2nd Edition 5
Computational Complexity
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1
d d 1
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
3/8/2021 Introduction to Data Mining, 2nd Edition 7
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
3/8/2021 Introduction to Data Mining, 2nd Edition 13
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1
Minimum Support = 3
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 3
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
3/8/2021 Introduction to Data Mining, 2nd Edition 39
Rule Generation
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
Size of database
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6C + 6C + 6C
1 2 3
If every subset is considered,
6 + 15 + 20 = 41 6C + 6C + 6C + 6C
With support-based pruning, 1 2 3 4
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16
10
3
10
Number of frequent itemsets
k
k 1
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
(A1-A10)
(B1-B10)
(C1-C10)
Items
2
3
Transactions
4
5
6
7
8
9
10
Items
4
5
6
7
8
9
10
Items
6
7
8
9
10
Items
7
8
9
10
Items
4
5
6
7
8
9
10
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed frequent = 9
# Maximal freaquent = 4
ABCDE
(A1-A10)
(B1-B10)
(C1-C10)
Items
4
5
6
7
8
9
10
Items
4
5
6
7
8
9
10
Items
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Items
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Items
1
2
3
Transactions
4
5
6
7
8
9
10
Items
1
2
3
Transactions
4
5
6
7
8
9
10
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
The criterion
confidence(X Y) = support(Y)
is equivalent to:
– P(Y|X) = P(Y)
– P(X,Y) = P(X) P(Y) (X and Y are independent)
P (Y | X )
Lift
P (Y ) lift is used for rules while
interest is used for itemsets
P( X , Y )
Interest
P ( X ) P (Y )
PS P ( X , Y ) P ( X ) P (Y )
P ( X , Y ) P ( X ) P (Y )
coefficient
P ( X )[1 P ( X )]P (Y )[1 P (Y )]
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
Transaction 1
.
.
.
.
.
Transaction N
Transaction 1
.
.
.
.
.
Transaction N
Invariant measures:
cosine, Jaccard, All-confidence, confidence
Non-invariant measures:
correlation, Interest/Lift, odds ratio, etc
2x 3x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
Many items
Support with low
distribution of support
a retail data set
caviar milk
Observation:
conf(caviarmilk) is very high
but
conf(milkcaviar) is very low
Therefore,
min( conf(caviarmilk),
conf(milkcaviar) )
𝑠 𝑋 𝑠 𝑋 𝑠 𝑋
hconf 𝑋 = min , ,…,
𝑠(𝑥1 ) 𝑠(𝑥2 ) 𝑠(𝑥𝑑 )
𝑠(𝑋)
=
max 𝑠 𝑥1 , 𝑠 𝑥2 , … , 𝑠(𝑥𝑑 )
3/8/2021 Introduction to Data Mining, 2nd Edition 100
Cross Support and H-confidence
= 𝑟(𝑋)
0 ≤ hconf 𝑋 ≤ 𝑟(𝑋) ≤ 1
H-confidence is anti-monotone