0% found this document useful (0 votes)

22 views

Chap5 Basic Association Analysis

Uploaded by

lutfan ikhsandi yumna arrafi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Chap5 Basic Association Analysis

Uploaded by

lutfan ikhsandi yumna arrafi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

Data Mining

Chapter 5
Association Analysis: Basic Concepts

Introduction to Data Mining, 2nd Edition

by
Tan, Steinbach, Karpatne, Kumar

3/8/2021 Introduction to Data Mining, 2nd Edition 1

Association Rule Mining

 Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

3/8/2021 Introduction to Data Mining, 2nd Edition 2

Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset TID Items

 An itemset that contains k items 1 Bread, Milk

 Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
3/8/2021 Introduction to Data Mining, 2nd Edition 3
Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
X  Y, where X and Y are itemsets 1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper}  {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 Rule Evaluation Metrics
– Support (s)
 Fraction of transactions that contain Example:
both X and Y {Milk, Diaper}  {Beer}
– Confidence (c)
 Measures how often items in Y  (Milk , Diaper, Beer) 2
appear in transactions that s   0.4
|T| 5
contain X
 (Milk, Diaper, Beer) 2
c   0.67
 (Milk , Diaper) 3
3/8/2021 Introduction to Data Mining, 2nd Edition 4
Association Rule Mining Task

 Given a set of transactions T, the goal of

association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
3/8/2021 Introduction to Data Mining, 2nd Edition 5
Computational Complexity
 Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules

3/8/2021 Introduction to Data Mining, 2nd Edition 6

Mining Association Rules

TID Items Example of Rules:

1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
3/8/2021 Introduction to Data Mining, 2nd Edition 7
Mining Association Rules

 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is still

computationally expensive

3/8/2021 Introduction to Data Mining, 2nd Edition 8

Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given d items, there
are 2d possible
ABCDE candidate itemsets
3/8/2021 Introduction to Data Mining, 2nd Edition 9
Frequent Itemset Generation

 Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate

– Complexity ~ O(NMw) => Expensive since M = 2d !!!
3/8/2021 Introduction to Data Mining, 2nd Edition 10
Frequent Itemset Generation Strategies

 Reduce the number of candidates (M)

– Complete search: M=2d
– Use pruning techniques to reduce M

 Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms

 Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction

3/8/2021 Introduction to Data Mining, 2nd Edition 11

Reducing Number of Candidates

 Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

 Apriori principle holds due to the following property

of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support

3/8/2021 Introduction to Data Mining, 2nd Edition 12

Illustrating Apriori Principle

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
3/8/2021 Introduction to Data Mining, 2nd Edition 13
Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2nd Edition 14

Illustrating Apriori Principle

TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3

If every subset is considered,

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2nd Edition 15

Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate
Eggs 1 {Bread,Diaper}
{Beer, Milk}
candidates involving Coke
{Diaper, Milk} or Eggs)
{Beer,Diaper}

Minimum Support = 3

If every subset is considered,

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2nd Edition 16

Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3 candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support = 3

If every subset is considered,

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2nd Edition 17

Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk}
{ Beer,Bread,Diaper}
6 + 15 + 20 = 41 {Bread,Diaper,Milk}
With support-based pruning, { Beer, Bread, Milk}
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2nd Edition 18

Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1

3/8/2021 Introduction to Data Mining, 2nd Edition 19

Illustrating Apriori Principle TID
1
Items
Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13

3/8/2021 Introduction to Data Mining, 2nd Edition 20

Apriori Algorithm

– Fk: frequent k-itemsets

– Lk: candidate k-itemsets
 Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
 Candidate Generation: Generate Lk+1 from Fk
 Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
 Support Counting: Count the support of each candidate in
Lk+1 by scanning the DB
 Candidate Elimination: Eliminate candidates in Lk+1 that are
infrequent, leaving only those that are frequent => Fk+1

3/8/2021 Introduction to Data Mining, 2nd Edition 21

Candidate Generation: Brute-force method

TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

3/8/2021 Introduction to Data Mining, 2nd Edition 22

Candidate Generation: Merge Fk-1 and F1 itemsets

3/8/2021 Introduction to Data Mining, 2nd Edition 23

Candidate Generation: Fk-1 x Fk-1 Method

 Merge two frequent (k-1)-itemsets if their first (k-2) items

are identical

 F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share only

prefix of length 1 instead of length 2

3/8/2021 Introduction to Data Mining, 2nd Edition 24

Candidate Pruning

 Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets

 L4 = {ABCD,ABCE,ABDE} is the set of candidate

4-itemsets generated (from previous slide)

 Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

 After candidate pruning: L4 = {ABCD}

3/8/2021 Introduction to Data Mining, 2nd Edition 25
Candidate Generation: Fk-1 x Fk-1 Method

3/8/2021 Introduction to Data Mining, 2nd Edition 26

Illustrating Apriori Principle

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41 {Bread, Diaper, Milk} 2
With support-based pruning,
6 + 6 + 1 = 13 Use of Fk-1xFk-1 method for candidate generation results in
only one 3-itemset. This is eliminated after the support
counting step.

3/8/2021 Introduction to Data Mining, 2nd Edition 27

Alternate Fk-1 x Fk-1 Method

 Merge two frequent (k-1)-itemsets if the last (k-2) items of

the first one is identical to the first (k-2) items of the
second.

 F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE

3/8/2021 Introduction to Data Mining, 2nd Edition 28

Candidate Pruning for Alternate Fk-1 x Fk-1 Method

 Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets

 L4 = {ABCD,ABDE,ACDE,BCDE} is the set of

candidate 4-itemsets generated (from previous
slide)
 Candidate pruning
– Prune ABDE because ADE is infrequent
– Prune ACDE because ACE and ADE are infrequent
– Prune BCDE because BCE
 After candidate pruning: L4 = {ABCD}
3/8/2021 Introduction to Data Mining, 2nd Edition 29
Support Counting of Candidate Itemsets

 Scan the database of transactions to determine the

support of each candidate itemset
– Must match every candidate itemset against every transaction,
which is an expensive operation

TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

3/8/2021 Introduction to Data Mining, 2nd Edition 30

Support Counting of Candidate Itemsets

 To reduce number of comparisons, store the candidate

itemsets in a hash structure
– Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions Hash Structure

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets

3/8/2021 Introduction to Data Mining, 2nd Edition 31

Support Counting: An Example
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

How many of these itemsets are supported by transaction (1,2,3,5,6)?

Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items

3/8/2021 Introduction to Data Mining, 2nd Edition 32
Support Counting Using a Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)

Hash function 234

3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
3/8/2021 Introduction to Data Mining, 2nd Edition 33
Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2nd Edition 34

Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2nd Edition 35

Support Counting Using a Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2nd Edition 36

Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2nd Edition 37

Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

3/8/2021 Introduction to Data Mining, 2nd Edition 38

Support Counting Using a Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
3/8/2021 Introduction to Data Mining, 2nd Edition 39
Rule Generation

 Given a frequent itemset L, find all non-empty

subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

 If |L| = k, then there are 2k – 2 candidate

association rules (ignoring L   and   L)

3/8/2021 Introduction to Data Mining, 2nd Edition 40

Rule Generation

 In general, confidence does not have an anti-

monotone property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same

itemset has an anti-monotone property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:

c(ABC  D)  c(AB  CD)  c(A  BCD)

– Confidence is anti-monotone w.r.t. number of items

on the RHS of the rule
3/8/2021 Introduction to Data Mining, 2nd Edition 41
Rule Generation for Apriori Algorithm

Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules

3/8/2021 Introduction to Data Mining, 2nd Edition 42

Association Analysis: Basic Concepts
and Algorithms

Algorithms and Complexity

3/8/2021 Introduction to Data Mining, 2nd Edition 43

Factors Affecting Complexity of Apriori

 Choice of minimum support threshold

 Dimensionality (number of items) of the data set

 Size of database

 Average transaction width

–

3/8/2021 Introduction to Data Mining, 2nd Edition 44

Factors Affecting Complexity of Apriori

 Choice of minimum support threshold

 Size of database TID Items

– 1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
 Average transaction width 3 Beer, Coke, Diaper, Milk
– 4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

3/8/2021 Introduction to Data Mining, 2nd Edition 45

Impact of Support Based Pruning

Minimum Support = 3
Minimum Support = 2
If every subset is considered,
6C + 6C + 6C
1 2 3
If every subset is considered,
6 + 15 + 20 = 41 6C + 6C + 6C + 6C
With support-based pruning, 1 2 3 4
6 + 15 + 20 +15 = 56
6 + 6 + 4 = 16

3/8/2021 Introduction to Data Mining, 2nd Edition 46

Factors Affecting Complexity of Apriori

 Choice of minimum support threshold

– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
TID Items
 Average transaction width 1 Bread, Milk
– 2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk

3/8/2021 Introduction to Data Mining, 2nd Edition 47

Factors Affecting Complexity of Apriori

 Choice of minimum support threshold

– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
– run time of algorithm increases with number of transactions
 Average transaction width
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
3/8/2021 Introduction to Data Mining, 2nd Edition 48
Factors Affecting Complexity of Apriori

 Choice of minimum support threshold

– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
– More space is needed to store support count of itemsets
– if number of frequent itemsets also increases, both computation
and I/O costs may also increase
 Size of database
– run time of algorithm increases with number of transactions
 Average transaction width
– transaction width increases the max length of frequent itemsets
– number of subsets in a transaction increases with its width,
increasing computation time for support counting

3/8/2021 Introduction to Data Mining, 2nd Edition 49

Factors Affecting Complexity of Apriori

3/8/2021 Introduction to Data Mining, 2nd Edition 50

Compact Representation of Frequent Itemsets

 Some frequent itemsets are redundant because their

supersets are also frequent
Consider the following data set. Assume support threshold =5
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

10 
 3   
10
Number of frequent itemsets
k
k 1

 Need a compact representation

3/8/2021 Introduction to Data Mining, 2nd Edition 51
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its
immediate supersets is frequent null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets Border
ABCD
E

3/8/2021 Introduction to Data Mining, 2nd Edition 52

What are the Maximal Frequent Itemsets in this Data?

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Minimum support threshold = 5

(A1-A10)
(B1-B10)
(C1-C10)

3/8/2021 Introduction to Data Mining, 2nd Edition 53

An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5

Frequent itemsets: ?
1 Maximal itemsets: ?

2
3
Transactions

4
5
6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 54

An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5

Frequent itemsets: {F}
1 Maximal itemsets: {F}

2 Support threshold (by count): 4

Frequent itemsets: ?
3 Maximal itemsets: ?
Transactions

4
5
6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 55

An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5

Frequent itemsets: {F}
1 Maximal itemsets: {F}

2 Support threshold (by count): 4

Frequent itemsets: {E}, {F}, {E,F}, {J}
3 Maximal itemsets: {E,F}, {J}
Transactions

4 Support threshold (by count): 3

Frequent itemsets: ?
5 Maximal itemsets: ?

6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 56

An illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5

Frequent itemsets: {F}
1 Maximal itemsets: {F}

2 Support threshold (by count): 4

Frequent itemsets: {E}, {F}, {E,F}, {J}
3 Maximal itemsets: {E,F}, {J}
Transactions

4 Support threshold (by count): 3

Frequent itemsets:
5 All subsets of {C,D,E,F} + {J}
Maximal itemsets:
6 {C,D,E,F}, {J}

7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 57

Another illustrative example

Items

A B C D E F G H I J Support threshold (by count) : 5

Maximal itemsets: {A}, {B}, {C}
1
Support threshold (by count): 4
2 Maximal itemsets: {A,B}, {A,C},{B,C}

3 Support threshold (by count): 3

Maximal itemsets: {A,B,C}
Transactions

4
5
6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 58

Closed Itemset

 An itemset X is closed if none of its immediate supersets

has the same support as the itemset X.
 X is not closed if at least one of its immediate supersets
has support count as X.

3/8/2021 Introduction to Data Mining, 2nd Edition 59

Closed Itemset

 An itemset X is closed if none of its immediate supersets

has the same support as the itemset X.
 X is not closed if at least one of its immediate supersets
has support count as X.
Itemset Support
{A} 4
TID Items {B} 5 Itemset Support
1 {A,B} {C} 3 {A,B,C} 2
2 {B,C,D} {D} 4 {A,B,D} 3
3 {A,B,C,D} {A,B} 4 {A,C,D} 2
4 {A,B,D} {A,C} 2 {B,C,D} 2
5 {A,B,C,D} {A,D} 3 {A,B,C,D} 2
{B,C} 3
{B,D} 4
{C,D} 3
3/8/2021 Introduction to Data Mining, 2nd Edition 60
Maximal vs Closed Itemsets
null
Transaction Ids
TID Items
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE

3/8/2021 Introduction to Data Mining, 2nd Edition 61

Maximal Frequent vs Closed Frequent Itemsets

TID Items Minimum support = 2 null

Closed but
1 ABC
not maximal
2 ABCD 124 123 1234 245 345
Closed and
3 BCE A B C D E
maximal
4 ACDE
5 DE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE # Closed frequent = 9
# Maximal freaquent = 4

ABCDE

3/8/2021 Introduction to Data Mining, 2nd Edition 62

What are the Closed Itemsets in this Data?

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

(A1-A10)
(B1-B10)
(C1-C10)

3/8/2021 Introduction to Data Mining, 2nd Edition 63

Example 1

Items

A B C D E F G H I J Itemsets Support Closed

(counts) itemsets
1
{C} 3
2
{D} 2
3
{C,D} 2
Transactions

4
5
6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 64

Example 1

Items

A B C D E F G H I J Itemsets Support Closed

(counts) itemsets
1
{C} 3 
2
{D} 2
3
{C,D} 2 
Transactions

4
5
6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 65

Example 2

Items

A B C D E F G H I J Itemsets Support Closed

(counts) itemsets
1
{C} 3
2
{D} 2
3
{E} 2
Transactions

4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 66

Example 2

Items

A B C D E F G H I J Itemsets Support Closed

(counts) itemsets
1
{C} 3 
2
{D} 2
3
{E} 2
Transactions

4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2 
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 67

Example 3

Items

A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C,F}

1
2
3
Transactions

4
5
6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 68

Example 4

Items

A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C}, {F}

1
2
3
Transactions

4
5
6
7
8
9
10

3/8/2021 Introduction to Data Mining, 2nd Edition 69

Maximal vs Closed Itemsets

3/8/2021 Introduction to Data Mining, 2nd Edition 70

Example question
 Given the following transaction data sets (dark cells indicate presence of an item
in a transaction) and a support threshold of 20%, answer the following questions

DataSet: A Data Set: B Data Set: C

a. What is the number of frequent itemsets for each dataset? Which dataset will produce the
most number of frequent itemsets?
b. Which dataset will produce the longest frequent itemset?
c. Which dataset will produce frequent itemsets with highest maximum support?
d. Which dataset will produce frequent itemsets containing items with widely varying support
levels (i.e., itemsets containing items with mixed support, ranging from 20% to more than
70%)?
e. What is the number of maximal frequent itemsets for each dataset? Which dataset will
produce the most number of maximal frequent itemsets?
f. What is the number of closed frequent itemsets for each dataset? Which dataset will produce
the most number of closed frequent itemsets?
3/8/2021 Introduction to Data Mining, 2nd Edition 71
Pattern Evaluation

 Association rule algorithms can produce large

number of rules

 Interestingness measures can be used to

prune/rank the patterns
– In the original formulation, support & confidence are
the only measures used

3/8/2021 Introduction to Data Mining, 2nd Edition 72

Computing Interestingness Measure

 Given X  Y or {X,Y}, information needed to compute

interestingness can be obtained from a contingency table
Contingency table
Y Y f11: support of X and Y
X f11 f10 f1+ f10: support of X and Y
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 N f00: support of X and Y

Used to define various measures

 support, confidence, Gini,
entropy, etc.

3/8/2021 Introduction to Data Mining, 2nd Edition 73

Drawback of Confidence
Custo Tea Coffee …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …
…

Association Rule: Tea  Coffee

Confidence  P(Coffee|Tea) = 150/200 = 0.75

Confidence > 50%, meaning people who drink tea are more
likely to drink coffee than not drink coffee
So rule seems reasonable
3/8/2021 Introduction to Data Mining, 2nd Edition 74
Drawback of Confidence

Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 150/200 = 0.75

but P(Coffee) = 0.8, which means knowing that a person drinks

tea reduces the probability that the person drinks coffee!
 Note that P(Coffee|Tea) = 650/800 = 0.8125

3/8/2021 Introduction to Data Mining, 2nd Edition 75

Drawback of Confidence
Custo Tea Honey …
mers
C1 0 1 …
C2 1 0 …
C3 1 1 …
C4 1 0 …
…

Association Rule: Tea  Honey

Confidence  P(Honey|Tea) = 100/200 = 0.50
Confidence = 50%, which may mean that drinking tea has little
influence whether honey is used or not
So rule seems uninteresting
But P(Honey) = 120/1000 = .12 (hence tea drinkers are far
more likely to have honey nd
3/8/2021 Introduction to Data Mining, 2 Edition 76
Measure for Association Rules

 So, what kind of rules do we really want?

– Confidence(X  Y) should be sufficiently high
 To ensure that people who buy X will more likely buy Y than
not buy Y

– Confidence(X  Y) > support(Y)

 Otherwise, rule will be misleading because having item X
actually reduces the chance of having item Y in the same
transaction
 Is there any measure that capture this constraint?
– Answer: Yes. There are many of them.

3/8/2021 Introduction to Data Mining, 2nd Edition 77

Statistical Relationship between X and Y

 The criterion
confidence(X  Y) = support(Y)

is equivalent to:
– P(Y|X) = P(Y)
– P(X,Y) = P(X)  P(Y) (X and Y are independent)

If P(X,Y) > P(X)  P(Y) : X & Y are positively correlated

If P(X,Y) < P(X)  P(Y) : X & Y are negatively correlated

3/8/2021 Introduction to Data Mining, 2nd Edition 78

Measures that take into account statistical dependence

P (Y | X )
Lift 
P (Y ) lift is used for rules while
interest is used for itemsets
P( X , Y )
Interest 
P ( X ) P (Y )
PS  P ( X , Y )  P ( X ) P (Y )
P ( X , Y )  P ( X ) P (Y )
  coefficient 
P ( X )[1  P ( X )]P (Y )[1  P (Y )]

3/8/2021 Introduction to Data Mining, 2nd Edition 79

Example: Lift/Interest

Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.8
 Interest = 0.15 / (0.2×0.8) = 0.9375 (< 1, therefore is negatively
associated)
So, is it enough to use confidence/Interest for pruning?
3/8/2021 Introduction to Data Mining, 2nd Edition 80
There are lots of
measures proposed
in the literature

3/8/2021 Introduction to Data Mining, 2nd Edition 81

Comparing Different Measures

10 examples of Rankings of contingency tables

contingency tables: using various measures:

Example f11 f10 f01 f00

E1 8123 83 424 1370
E2 8330 2 622 1046
E3 9481 94 127 298
E4 3954 3080 5 2961
E5 2886 1363 1320 4431
E6 1500 2000 500 6000
E7 4000 2000 1000 3000
E8 4000 2000 2000 2000
E9 1720 7121 5 1154
E10 61 2483 4 7452

3/8/2021 Introduction to Data Mining, 2nd Edition 82

Property under Inversion Operation

Transaction 1

.
.
.
.
.
Transaction N

3/8/2021 Introduction to Data Mining, 2nd Edition 83

Property under Inversion Operation

Transaction 1

.
.
.
.
.
Transaction N

Correlation: -0.1667 -0.1667

IS/cosine 0.0 0.825

3/8/2021 Introduction to Data Mining, 2nd Edition 84

Property under Null Addition

Invariant measures:
 cosine, Jaccard, All-confidence, confidence
Non-invariant measures:
 correlation, Interest/Lift, odds ratio, etc

3/8/2021 Introduction to Data Mining, 2nd Edition 85

Property under Row/Column Scaling

Grade-Gender Example (Mosteller, 1968):

Male Female Male Female

High 30 20 50 High 60 60 120

Low 40 10 50 Low 80 30 110
70 30 100 140 90 230

2x 3x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples

Odds-Ratio ((f11+f00 )/(f10+f10)) has this property

3/8/2021 Introduction to Data Mining, 2nd Edition 86
Property under Row/Column Scaling

Relationship between Mask use and susceptibility to Covid:

Covid- Covid- Covid- Covid-

Positive Free Positive Free
Mask 20 30 50 Mask 40 300 340
No- 40 10 50 No- 80 100 180
Mask Mask
60 40 100 120 400 520
2x 10x
Mosteller:
Underlying association should be independent of
the relative number of Covid-positive and Covid-free subjects

Odds-Ratio ((f11+f00 )/(f10+f10)) has this property

3/8/2021 Introduction to Data Mining, 2nd Edition 87
Different Measures have Different Properties

3/8/2021 Introduction to Data Mining, 2nd Edition 88

Simpson’s Paradox

 Observed relationship in data may be influenced

by the presence of other confounding factors
(hidden variables)
– Hidden variables may cause the observed relationship
to disappear or reverse its direction!

 Proper stratification is needed to avoid generating

spurious patterns

3/8/2021 Introduction to Data Mining, 2nd Edition 89

Simpson’s Paradox

 Recovery rate from Covid

– Hospital A: 80%
– Hospital B: 90%
 Which hospital is better?

3/8/2021 Introduction to Data Mining, 2nd Edition 90

Simpson’s Paradox

 Recovery rate from Covid

– Hospital A: 80%
– Hospital B: 90%
 Which hospital is better?
 Covid recovery rate on older population
– Hospital A: 50%
– Hospital B: 30%
 Covid recovery rate on younger population
– Hospital A: 99%
– Hospital B: 98%

3/8/2021 Introduction to Data Mining, 2nd Edition 91

Simpson’s Paradox

 Covid-19 death: (per 100,000 of population)

– County A: 15
– County B: 10
 Which state is managing the pandemic better?

3/8/2021 Introduction to Data Mining, 2nd Edition 92

Simpson’s Paradox

 Covid-19 death: (per 100,000 of population)

– County A: 15
– County B: 10
 Which state is managing the pandemic better?
 Covid death rate on older population
– County A: 20
– County B: 40
 Covid death rate on younger population
– County A: 2
– County B: 5

3/8/2021 Introduction to Data Mining, 2nd Edition 93

Effect of Support Distribution on Association Mining

 Many real data sets have skewed support

distribution
Few items
with high
support

Many items
Support with low
distribution of support
a retail data set

Rank of item (in log scale)

3/8/2021 Introduction to Data Mining, 2nd Edition 94

Effect of Support Distribution

 Difficult to set the appropriate minsup threshold

– If minsup is too high, we could miss itemsets involving
interesting rare items (e.g., {caviar, vodka})

– If minsup is too low, it is computationally expensive

and the number of itemsets is very large

3/8/2021 Introduction to Data Mining, 2nd Edition 95

Cross-Support Patterns

A cross-support pattern involves

items with varying degree of
support
• Example: {caviar,milk}

How to avoid such patterns?

caviar milk

3/8/2021 Introduction to Data Mining, 2nd Edition 96

A Measure of Cross Support

 Given an itemset,𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑑 }, with 𝑑 items, we can

define a measure of cross support,r, for the itemset

𝐦𝐢𝐧{𝑠(𝑥1 ), 𝑠(𝑥2 ), … , 𝑠(𝑥𝑑 )}

𝑟(𝑋) =
𝐦𝐚𝐱{𝑠(𝑥1 ), 𝑠(𝑥2 ), … , 𝑠(𝑥𝑑 )}

where 𝑠(𝑥𝑖 ) is the support of item 𝑥𝑖

– Can use 𝑟 𝑋 to prune cross support patterns

3/8/2021 Introduction to Data Mining, 2nd Edition 97

Confidence and Cross-Support Patterns

Observation:
conf(caviarmilk) is very high
but
conf(milkcaviar) is very low

Therefore,
min( conf(caviarmilk),
conf(milkcaviar) )

is also very low

caviar milk

3/8/2021 Introduction to Data Mining, 2nd Edition 98

H-Confidence

 To avoid patterns whose items have very

different support, define a new evaluation
measure for itemsets
– Known as h-confidence or all-confidence

 Specifically, given an itemset 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑑 }

– h-confidence is the minimum confidence of any
association rule formed from itemset 𝑋

– hconf( 𝑋 ) = min( conf(𝑋1→ 𝑋2) ),

where 𝑋1 , 𝑋2 ⊂ 𝑋, 𝑋1 ∩ 𝑋2 = ∅, 𝑋1 ∪ 𝑋2 = 𝑋
For example: 𝑋1 = 𝑥1 , 𝑥2 , 𝑋2 = {𝑥3 , … , 𝑥𝑑 }

3/8/2021 Introduction to Data Mining, 2nd Edition 99

H-Confidence …

 But, given an itemset 𝑋 = {𝑥1 , 𝑥2 , … , 𝑥𝑑 }

– What is the lowest confidence rule you can obtain
from 𝑋?
– Recall conf(𝑋1 →𝑋2 ) = s(𝑋1 ∪ 𝑋2 ) / support(𝑋1 )
 The numerator is fixed: s(𝑋1 ∪ 𝑋2 ) = s(X )
 Thus, to find the lowest confidence rule, we need to find the
X1 with highest support
 Consider only rules where 𝑋1 is a single item, i.e.,
{𝑥1 }  𝑋 – {𝑥1 }, {𝑥2 }  𝑋 – {𝑥2 }, …, or {𝑥𝑑 }  𝑋 – {𝑥𝑑 }

𝑠 𝑋 𝑠 𝑋 𝑠 𝑋
hconf 𝑋 = min , ,…,
𝑠(𝑥1 ) 𝑠(𝑥2 ) 𝑠(𝑥𝑑 )

𝑠(𝑋)
=
max 𝑠 𝑥1 , 𝑠 𝑥2 , … , 𝑠(𝑥𝑑 )
3/8/2021 Introduction to Data Mining, 2nd Edition 100
Cross Support and H-confidence

 By the anti-montone property of support

𝑠 𝑋 ≤ min 𝑠(𝑥1 ), 𝑠(𝑥2 ), … , 𝑠(𝑥𝑑 )

 Therefore, we can derive a relationship between

the h-confidence and cross support of an itemset
𝑠(𝑋)
hconf 𝑋 =
max 𝑠 𝑥1 , 𝑠 𝑥2 , … , 𝑠(𝑥𝑑 )
min 𝑠(𝑥1 ), 𝑠(𝑥2 ), …, 𝑠(𝑥𝑑 )
≤
max 𝑠 𝑥1 , 𝑠 𝑥2 , … , 𝑠 𝑥𝑑

= 𝑟(𝑋)

Thus, hconf 𝑋 ≤ 𝑟(𝑋)

3/8/2021 Introduction to Data Mining, 2nd Edition 101
Cross Support and H-confidence …

 Since, hconf 𝑋 ≤ 𝑟 𝑋 , we can eliminate cross

support patterns by finding patterns with
h-confidence < hc, a user set threshold
 Notice that

0 ≤ hconf 𝑋 ≤ 𝑟(𝑋) ≤ 1

 Any itemset satisfying a given h-confidence

threshold, hc, is called a hyperclique
 H-confidence can be used instead of or in
conjunction with support
3/8/2021 Introduction to Data Mining, 2nd Edition 102
Properties of Hypercliques

 Hypercliques are itemsets, but not necessarily

frequent itemsets
– Good for finding low support patterns

 H-confidence is anti-monotone

 Can define closed and maximal hypercliques in

terms of h-confidence
– A hyperclique X is closed if none of its immediate
supersets has the same h-confidence as X
– A hyperclique X is maximal if hconf(𝑋) ≤ hc and none
of its immediate supersets, Y, have hconf 𝑌 ≤ hc

3/8/2021 Introduction to Data Mining, 2nd Edition 103

Properties of Hypercliques …

 Hypercliques have the high-affinity property

– Think of the individual items as sparse binary vectors
– h-confidence gives us information about their pairwise
Jaccard and cosine similarity
 Assume 𝑥1 and 𝑥2 are any two items in an itemset X
 Jaccard(𝑥1 , 𝑥2 ) ≥ hconf(X)/2
 cos(𝑥1 , 𝑥2 ) ≥ hconf(X)
– Hypercliques that have a high h-confidence consist of
very similar items as measured by Jaccard and cosine
 The items in a hyperclique cannot have widely
different support
– Allows for more efficient pruning

3/8/2021 Introduction to Data Mining, 2nd Edition 104

Example Applications of Hypercliques

 Hypercliques are used to

find strongly coherent
groups of items
– Words that occur together
in documents
– Proteins in a protein
interaction network

In the figure at the right, a gene

ontology hierarchy for biological
process shows that the identified
proteins in the hyperclique (PRE2, …,
SCL1) perform the same function and
are involved in the same biological
process
3/8/2021 Introduction to Data Mining, 2nd Edition 105

IS208 PROFESSIONAL ISSUES IN INFORMATION SYSTEMS Revised
67% (3)
IS208 PROFESSIONAL ISSUES IN INFORMATION SYSTEMS Revised
2 pages
Hi3521D V100 H.265 Codec Processor Data Sheet
No ratings yet
Hi3521D V100 H.265 Codec Processor Data Sheet
1,017 pages
Chap5 Basic Association Analysis
No ratings yet
Chap5 Basic Association Analysis
105 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
104 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
102 pages
Chapter 5
No ratings yet
Chapter 5
37 pages
Chap5-Association Analysis
No ratings yet
Chap5-Association Analysis
102 pages
Chap5-Association Analysis
No ratings yet
Chap5-Association Analysis
29 pages
Rule Mining
No ratings yet
Rule Mining
20 pages
Unit 4 DWM by DR KSR Association - Analysis
No ratings yet
Unit 4 DWM by DR KSR Association - Analysis
68 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
3AR
No ratings yet
3AR
62 pages
Association Rule Mining
No ratings yet
Association Rule Mining
92 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
102 pages
BITS WASE Data Mining Session 5 PDF
No ratings yet
BITS WASE Data Mining Session 5 PDF
83 pages
Association Rules & Frequent Itemsets: The Market-Basket Problem
No ratings yet
Association Rules & Frequent Itemsets: The Market-Basket Problem
5 pages
dmunit2
No ratings yet
dmunit2
85 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
No ratings yet
Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
42 pages
Association Rule Mining Task
No ratings yet
Association Rule Mining Task
40 pages
Association
No ratings yet
Association
67 pages
06FPBasic
No ratings yet
06FPBasic
77 pages
04 Frequent Patterns Analysis
No ratings yet
04 Frequent Patterns Analysis
37 pages
Unit 3- Asso Rule Mining
No ratings yet
Unit 3- Asso Rule Mining
27 pages
Slides
No ratings yet
Slides
92 pages
DM Association
No ratings yet
DM Association
43 pages
New Microsoft Power Point Presentation
No ratings yet
New Microsoft Power Point Presentation
18 pages
DSTBD_9-DMassrules
No ratings yet
DSTBD_9-DMassrules
98 pages
association rule
No ratings yet
association rule
22 pages
Unit 2
No ratings yet
Unit 2
14 pages
Rule Mining by Akshay Rele
No ratings yet
Rule Mining by Akshay Rele
42 pages
CS2202_AssociationRuleMining
No ratings yet
CS2202_AssociationRuleMining
59 pages
DM Mod3 PDF
No ratings yet
DM Mod3 PDF
96 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
DS2 Association
No ratings yet
DS2 Association
48 pages
UNIT 4 .3 ASSOCIATION ANALYSIS
No ratings yet
UNIT 4 .3 ASSOCIATION ANALYSIS
50 pages
Lect 6
No ratings yet
Lect 6
74 pages
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
No ratings yet
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
41 pages
Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Association Analysis: Basic Concepts and Algorithms
38 pages
Arm PPT
No ratings yet
Arm PPT
15 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
Association Rule
No ratings yet
Association Rule
17 pages
Data Mining Task - Association Rule Mining
No ratings yet
Data Mining Task - Association Rule Mining
30 pages
Unit 5
No ratings yet
Unit 5
40 pages
Association
No ratings yet
Association
54 pages
AprioriTID Algorithm Improved From Apriori Algorithm
No ratings yet
AprioriTID Algorithm Improved From Apriori Algorithm
5 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
54 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
MS (Data Science) Fall 2020 Semester
No ratings yet
MS (Data Science) Fall 2020 Semester
36 pages
Datamining Lect2 Frequent
No ratings yet
Datamining Lect2 Frequent
59 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
BD25
No ratings yet
BD25
19 pages
06 Association Rules
No ratings yet
06 Association Rules
32 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
06 FPBasic
No ratings yet
06 FPBasic
103 pages
Module1 Part2
No ratings yet
Module1 Part2
17 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
33 pages
WSSOA Objective Questions
No ratings yet
WSSOA Objective Questions
6 pages
Tutorial3-With Answer Key
No ratings yet
Tutorial3-With Answer Key
12 pages
SAMPLE PE-Summer 2017: xx/xx/2017 Data Structures and Algorithms Using Java
No ratings yet
SAMPLE PE-Summer 2017: xx/xx/2017 Data Structures and Algorithms Using Java
3 pages
PS4 Slim HDMI IC Pin Readings
No ratings yet
PS4 Slim HDMI IC Pin Readings
1 page
Mini Home Theater: Service Manual
No ratings yet
Mini Home Theater: Service Manual
69 pages
Taurus Series Multimedia Player TB3 Specifications V1.6.5
No ratings yet
Taurus Series Multimedia Player TB3 Specifications V1.6.5
9 pages
Chapter 5
No ratings yet
Chapter 5
23 pages
Class 8 ICTech WSheet # 1
No ratings yet
Class 8 ICTech WSheet # 1
2 pages
HTML Question Answers
No ratings yet
HTML Question Answers
5 pages
Textures4ever Vol 3
No ratings yet
Textures4ever Vol 3
48 pages
Ets2 Mods 6
No ratings yet
Ets2 Mods 6
5 pages
MML Command: Relations Between Mos
No ratings yet
MML Command: Relations Between Mos
10 pages
Livros Gratuitos - R and Statistics
No ratings yet
Livros Gratuitos - R and Statistics
2 pages
22658-2024-Summer-question-paper
No ratings yet
22658-2024-Summer-question-paper
4 pages
Updated time table
No ratings yet
Updated time table
1 page
SEECAdmin Guide
No ratings yet
SEECAdmin Guide
148 pages
D2R Season 10 Everything You Need to Know
No ratings yet
D2R Season 10 Everything You Need to Know
5 pages
Daewoo Doosan d20s 5 Service
No ratings yet
Daewoo Doosan d20s 5 Service
1,450 pages
Microsoft - Third Party Notices - Apps - 2021 08 03
No ratings yet
Microsoft - Third Party Notices - Apps - 2021 08 03
48 pages
CC Unit 4 Notes
No ratings yet
CC Unit 4 Notes
34 pages
AUTOSAR SWS SAEJ1939DiagnosticCommunicationManager
No ratings yet
AUTOSAR SWS SAEJ1939DiagnosticCommunicationManager
83 pages
Unit 1 - Number and Algebra 1 End of Unit Test
No ratings yet
Unit 1 - Number and Algebra 1 End of Unit Test
4 pages
A1+ UNIT 5 Culture Teacher's Notes
No ratings yet
A1+ UNIT 5 Culture Teacher's Notes
1 page
Bhavik Munot - Software Engineer
0% (1)
Bhavik Munot - Software Engineer
2 pages
Artificial Intelligence (Industrial Applications) : 1. Manufacturing
100% (1)
Artificial Intelligence (Industrial Applications) : 1. Manufacturing
11 pages
Kaysun Fan Coil
No ratings yet
Kaysun Fan Coil
8 pages
Ijesrr V-1-6-2e
No ratings yet
Ijesrr V-1-6-2e
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.