0% found this document useful (0 votes)
9 views

dmunit2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

dmunit2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Association Rule Mining

Data Mining 1
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 2
Frequent Pattern
• Frequent Pattern: a pattern (a set of items, subsequences, substructures, etc.) that
occurs frequently in a data set.
• For example, a set of items, such as milk and bread, that appear frequently together in
a transaction data set is a frequent itemset.
• A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern.
• A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern.

Data Mining 3
Frequent Pattern
Market Basket Analysis
• Frequent Pattern: a pattern that occurs frequently in a data set.
– A set of items that appear frequently together in a transaction data set is called as a
frequent itemset.
• An example of frequent itemset mining is market basket analysis.
– This process analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”.
– If we think of the universe as the set of items available at the store, then each item has a
Boolean variable representing the presence or absence of that item.
– Each basket can then be represented by a Boolean vector of values assigned to these
variables.
– The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together.
– These patterns can be represented in the form of association rules.

Data Mining 4
Basic Concepts: Frequent Patterns
Tid Items bought
• itemset: A set of one or more items
10 Beer, Nuts, Diaper
– k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs • (absolute) support of X: Frequency of an
40 Nuts, Eggs, Milk itemset X.
50 Nuts, Coffee, Diaper, Eggs, Milk – Absolute Support of {Beer} is 3
• (relative) support of X is the fraction of
Customer Customer transactions that contains X (i.e., the
buys both buys diaper probability that a transaction contains X).
– Relative Support of {Beer} is 3/5
• An itemset X is frequent if X’s support
is no less than a minsup threshold.
Customer
buys beer

Data Mining 5
Basic Concepts: Association Rules
Association Rule
– An implication expression of the form X  Y, where X and Y are itemsets

Association Rule Mining:


• Find all the rules X  Y with minimum support and minimum confidence
– support, probability that a transaction contains XY : P(XY)
• Fraction of transactions that contain both X and Y
– confidence, conditional probability that a transaction having X also contains Y :
P(Y/X) = support(XY) / support(X)
• Measures how often items in Y appear in transactions that contain X

Data Mining 6
Basic Concepts: Association Rules
Association Rule
– An implication expression of the form X  Y, where X and Y are itemsets

Association Rule Mining:


• Find all the rules X  Y with minimum support and minimum confidence

Let minsup = 50%, minconf = 50%


Frequent Pattterns:
Tid Items bought
{Beer}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, 10 Beer, Nuts, Diaper
{Beer, Diaper}:3 20 Beer, Coffee, Diaper
Association Rules: 30 Beer, Diaper, Eggs
– { Beer }  { Diaper } (60%, 100%) 40 Nuts, Eggs, Milk
– { Diaper }  { Beer } (60%, 75%)
50 Nuts, Coffee, Diaper,
Eggs, Milk

Data Mining 7
Why Use Support and Confidence?
• Support is an important measure because a rule that has very low support may occur
simply by chance.
– A low support rule may be uninteresting from a business perspective because it may not be
profitable to promote items that customers seldom buy together
– For these reasons, support is often used to eliminate uninteresting rules

• Confidence measures the reliability of the inference made by a rule.


– For a given rule X  Y, the higher the confidence, the more likely it is for Y to be present
in transactions that contain X.

• Association analysis results should be interpreted with caution.


– The inference made by an association rule does not necessarily imply causality.
– Instead, it suggests a strong co-occurrence relationship between items in the antecedent and
consequent of the rule.

Data Mining 8
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all rules
having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally not feasible!

Data Mining 9
Mining Association Rules

TID Items Example of Rules:


1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different
confidence
• Thus, we may decouple the support and confidence requirements

Data Mining 10
Association Rule Mining
• The problem of mining association rules can be reduced to that of mining frequent
itemsets.

• In general, association rule mining can be viewed as a two-step process:


1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, minsup.
– Generate all itemsets whose support  minsup
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.
– Generate high confidence rules from each frequent itemset, where each rule is a
binary partitioning of a frequent itemset

– Frequent itemset generation is still computationally expensive

Data Mining 11
Association Rules - Example

Transactions minsup = 0.5 minconf=0.7


A,B,D
A,B,C,D • Find frequent itemsets and association rules satisfying
minsup and minconf.
A
A,B,C
B,C
B

Data Mining 12
Association Rules - Example

Transactions minsup = 0.5 minconf=0.7


A,B,D
• Find frequent itemsets and association rules satisfying
A,B,C,D minsup and minconf.
A
A,B,C Frequent Itemsets:
1-itemsets: {A} support({A}) = 4/6
B,C {B} support({B}) = 5/6
B {C} support({C}) = 3/6
2-itemsets: {A,B} support({A,B}) = 3/6
{B,C} support({B,C}) = 3/6
Association Rules:
A→B conf(A→B) = 3/4
C→B conf(C→B) = 3/3

Data Mining 13
Frequent Itemset Generation

An itemset lattice

Given d items, there are 2d


possible candidate itemsets

Data Mining 14
Frequent Itemset Generation

• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate


– Complexity ~ O(NMW) => Expensive since M = 2d !!!
Data Mining 15
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules

Data Mining 16
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
– The Apriori principle is an effective way to eliminate some of the candidate
itemsets without counting their support values.

• Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction

• Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases

Data Mining 17
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 18
Reducing Number of Candidates
Apriori Principle
• Apriori Principle: If an itemset is frequent, then all of its subsets
must also be frequent.

• Apriori principle holds due to the following property of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support

Data Mining 19
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE

Data Mining 20
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1

Minimum Support = 3 Generate 1-itemset candidates

If every subset is considered,


6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 21
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3
Eliminate infrequent 1-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 22
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate candidates
Eggs 1 {Bread,Diaper}
involving Coke or Eggs)
{Beer, Milk}
{Diaper, Milk}
{Beer,Diaper}

Minimum Support = 3
Generate 2-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 23
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate candidates
Eggs 1 {Bread,Diaper} 3 involving Coke or Eggs)
{Beer,Milk} 2
{Diaper,Milk} 3
{Beer,Diaper} 3
Minimum Support = 3
Eliminate infrequent 2-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 24
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate candidates
Eggs 1
{Bread,Diaper} 3 involving Coke or Eggs)
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk}
6 + 15 + 20 = 41 { Beer,Bread,Diaper}
With support-based pruning, {Bread, Diaper, Milk}
6 + 6 + 4 = 16 { Beer, Bread, Milk}

Generate 3-itemset candidates


Data Mining 25
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate candidates
Eggs 1
{Bread,Diaper} 3 involving Coke or Eggs)
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13 Prune 3-itemset candidates with infrequent 2-itemsets
Eliminate infrequent 3-itemset candidates
Data Mining 26
Apriori Algorithm:
Finding Frequent Itemsets Using Candidate Generation
• Apriori pruning principle: If there is any itemset which is infrequent, its superset
should not be generated/tested!

Apriori Algorithm: Fk: frequent k-itemsets Lk: candidate k-itemsets


• Let k=1
• Generate F1 = {frequent 1-itemsets}
• Repeat until Fk is empty
– Candidate Generation: Generate Lk+1 from Fk
– Candidate Pruning: Prune candidate itemsets in Lk+1 containing subsets of length k that
are infrequent
– Support Counting: Count the support of each candidate in Lk+1 by scanning the DB
– Candidate Elimination: Eliminate candidates in Lk+1 that are infrequent, leaving only
those that are frequent => Fk+1

Data Mining 27
Apriori Algorithm:
Candidate Generation: Fk-1 x Fk-1 Method
• Merge two frequent (k-1)-itemsets if their first (k-2) items are identical

• F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share only prefix of length 1 instead of


length 2

• L4 = {ABCD,ABCE,ABDE} is the set of candidate 4-itemsets generated

Data Mining 28
Apriori Algorithm:
Candidate Pruning
• Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of frequent 3-itemsets

• L4 = {ABCD,ABCE,ABDE} is the set of candidate 4-itemsets generated

• Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

• After candidate pruning: L4 = {ABCD}

Data Mining 29
Apriori Algorithm:
Support Counting of Candidate Itemsets
• Scan the database of transactions to determine the support of each candidate itemset
– Must match every candidate itemset against every transaction, which is an
expensive operation

• To reduce the number of comparisons, store the candidates in a hash structure


– Instead of matching each transaction against every candidate, match it against
candidates contained in the hashed buckets

Transactions Hash Structure


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets

Data Mining 30
Apriori Algorithm

Data Mining 31
Apriori Algorithm

Data Mining 32
Apriori Algorithm - An Example

Data Mining 33
Support Counting Using Hash Tree
• Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
– Must match every candidate itemset against every transaction, which is an
expensive operation

• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a transaction

Data Mining 34
Support Counting Using Hash Tree
Subset Operation
• Enumerating subsets of three items from a transaction t

Data Mining 35
Support Counting Using Hash Tree
Generate Candidate Hash Tree
• Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7},
{3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

• We need: Hash function


– HashFunc: mod 3

Hash function 234


3,6,9 567
1,4,7
2,5,8 145 345 356 367
136 357 368
689
124
457 125 159
458

Data Mining 36
Support Counting Using Hash Tree
Generate Candidate Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining 37
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts

1 2 3 5 6 transaction Hash Function

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining 38
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts

1 2 3 5 6 transaction Hash Function

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining 39
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 9 out of 15 candidates
Data Mining 40
Factors Affecting Complexity of Apriori Algorithm
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may also
increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with number of
transactions
• Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and number of subsets in a transaction
increases with its width

Data Mining 41
Effect of Support Threshold
• Effect of support threshold on the number of candidate and frequent itemsets

Number of candidate itemsets Number of frequent itemsets

Data Mining 42
Effect of Average Transaction Width
• Effect of average transaction width on the number of candidate and frequent itemsets

Number of candidate itemsets Number of frequent itemsets

Data Mining 43
Effect of Support Distribution
• How to set the appropriate minsup threshold?
– If minsup is set too high, we could miss itemsets involving interesting rare items
(e.g., expensive products)

– If minsup is set too low, it is computationally expensive and the number of


itemsets is very large

• Using a single minimum support threshold may not be effective

Data Mining 44
Multiple Minimum Support
• How to apply multiple minimum supports?
– MS(i): minimum support for item i
– e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%

– Challenge: Support is no longer anti-monotone


• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%

• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

Data Mining 45
Multiple Minimum Support
• Order the items according to their minimum support (in ascending order)
– e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– Ordering: Broccoli, Salmon, Coke, Milk

• Need to modify Apriori such that:


– L1 : set of frequent items
– F1 : set of items whose support is  MS(1)
where MS(1) is mini( MS(i) )
– C2 : candidate itemsets of size 2 is generated from F1
instead of L1

Data Mining 46
Multiple Minimum Support
• Modifications to Apriori:
– In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets of size k

– Pruning step has to be modified:


• Prune only if subset contains the first item
• e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
• {Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.

Data Mining 47
Rule Generation in Apriori Algorithm
• Given a frequent itemset L, find all non-empty subsets f  L such that candidate
rule f  L – f satisfies the minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC  D ABD  C ACD  B BCD  A
D  ABC C  ABD B  ACD A  BCD ,

AB  CD AC  BD AD  BC
CD  AB BD  AC BC  AD

• If |L| = k, then there are 2k – 2 candidate association rules


– (ignoring L   and   L)

Data Mining 48
Rule Generation in Apriori Algorithm
• How to efficiently generate rules from frequent itemsets?

• In general, confidence does not have an anti-monotone property


c(ABCD) can be larger or smaller than c(ABD)

• But confidence of rules generated from the same itemset has an anti-monotone
property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:

c(ABCD)  c(ABCD)  c(ABCD)

– Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Data Mining 49
Rule Generation in Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules
Data Mining 50
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 51
Compact Representation of Frequent Itemsets
• The number of frequent itemsets produced from a transaction data set can be very
large.
• Some produced itemsets can be redundant because they have identical support as their
supersets

• It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived.  Need a compact representation
– Maximal Frequent Itemsets and
– Closed Frequent Itemsets

Data Mining 52
Maximal Frequent Itemsets
Maximal Frequent Itemset: A maximal frequent itemset is defined as a frequent
itemset for which none of its immediate supersets are frequent.

• Maximal frequent itemsets effectively provide a compact representation of frequent


itemsets.
• Maximal frequent itemsets form the smallest set of itemsets from which all frequent
itemsets can be derived.

Data Mining 53
Maximal Frequent Itemsets
All frequent itemsets can be derived from
maximal frequent itemsets ad, ace, bcde.
Any frequent itemset 
a maximal frequent itemset

Data Mining 54
Maximal Frequent Itemsets
• Despite providing a compact representation, maximal frequent itemsets do not contain
the support information of their subsets.
• For example, the support of the maximal frequent itemsets {a, c, e}, {a, d}, and
{b,c,d,e} do not provide any hint about the support of their subsets.
• An additional pass over the data set is therefore needed to determine the support
counts of the non-maximal frequent itemsets.

• It might be desirable to have a minimal representation of frequent itemsets that


preserves the support information.  Closed Frequent Itemsets

Data Mining 55
Closed Frequent Itemsets
Closed Itemset: An itemset X is closed if none of its immediate supersets has exactly
the same support count as X.

• Closed itemsets provide a minimal representation of itemsets without losing their


support information.
• Put another way, X is not closed if at least one of its immediate supersets has the same
support count as X.

Closed Frequent Itemset: An itemset is a closed frequent itemset if it is closed and


its support is greater than or equal to minsup.

Data Mining 56
Closed Frequent Itemsets
All subsets of a closed frequent
itemset are frequent and their
supports is greater than or equal to
the support of that closed frequent
itemset.

For example, all subsets of a closed


frequent itemset abc are frequent
and their supports  support of abc.

Data Mining 57
Maximal vs Closed Itemsets

Closed but not maximal


frequent itemsets
Closed and maximal
frequent itemsets

# Closed = 9
# Maximal = 4

Data Mining 58
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

Data Mining 59
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 60
FP-Growth (Frequent Pattern Growth) Algorithm
• FP-growth algorithm that takes a radically different approach to discovering
frequent itemsets.
– The algorithm does not subscribe to the generate-and-test paradigm of Apriori

• FP-growth algorithm encodes the data set using a compact data structure called an
FP-tree and extracts frequent itemsets directly from this structure.
– Use a compressed representation of the database using an FP-tree
– Once an FP-tree has been constructed, it uses a recursive divide-and-conquer
approach to mine the frequent itemsets

Data Mining 61
FP-Tree Construction
• An FP-tree is a compressed representation of the input data.

• It is constructed by reading the data set one transaction at a time and mapping each
transaction onto a path in the FP-tree.
– Different transactions can have several items in common, their paths may overlap.
– The more the paths overlap with one another, the more compression we can
achieve using the FP-tree structure.

Data Mining 62
FP-Tree Construction
• Each node in the tree contains the label of an item along with a counter that shows the
number of transactions mapped onto the given path.
– Initially, the FP-tree contains only the root node represented by the null symbol.
– Every transaction maps onto one of the paths in the FP-tree.

• The size of an FP-tree is typically smaller than the size of the uncompressed data
because many transactions in market basket data often share a few items in common.
– best-case scenario, all transactions have same set of items
 FP-tree contains only a single branch.
– worst-case scenario happens when every transaction has a unique set of items
 FP-tree is effectively the same as the size of the original data.
– physical storage requirement for FP-tree is higher because it requires additional
space to store pointers between nodes and counters for each item.

Data Mining 63
FP-Tree Construction
TID Items After reading TID=1:
1 {A,B}
2 {B,C,D}
null
3 {A,C,D,E}
4 {A,D,E} A:1
5 {A,B,C}
6 {A,B,C,D} B:1
7 {B,C} After reading TID=3:
8 {A,B,C} After reading TID=2:
null
9 {A,B,D}
10 {B,C,E} B:1 null
A:2
A:1 B:1
B:1 C:1 C:1

D:1 B:1 C:1


D:1
E:1 D:1
Data Mining 64
FP-Tree Construction
TID Items After reading all transactions:
1 {A,B}
Transaction
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} A:7 B:3
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E} B:5 C:3
C:1 D:1

Header table D:1


C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D
E Pointers are used to assist frequent
sorted itemset generation
Data Mining 65
Frequent Itemset Generation
in FP-Growth Algorithm
• FP-growth is an algorithm that generates frequent itemsets from an FP-tree by
exploring the tree in a bottom-up fashion.
– This bottom-up strategy for finding frequent itemsets ending with a particular item is equivalent to the
suffix-based approach
– Since every transaction is mapped onto a path in the FP-tree, we can derive the frequent itemsets
ending with a particular item, say e, by examining only the paths containing node e.
– The algorithm looks for frequent itemsets ending in e first, followed by d, c, b, and finally, a.
• FP-growth finds all the frequent itemsets ending with a particular suffix by employing
a divide-and-conquer strategy to split the problem into smaller subproblems.
– To find all frequent itemsets ending in e, we must first check whether the itemset {e} itself is frequent.
– If it is frequent, we consider the subproblem of finding frequent itemsets ending in de, followed by ce,
be, and ae.
– In turn, each of these subproblems are further decomposed into smaller subproblems.
– By merging the solutions obtained from the subproblems, all the frequent itemsets ending in e can be
found.

Data Mining 66
Finding Frequent Itemsets Ending with e
1. The first step is to gather all the paths containing node e. These initial paths are called prefix
paths
2. From the prefix paths, the support count for e is obtained by adding the support counts
associated with node e. Assuming that the minimum support count is 2, {e} is declared a
frequent itemset because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent
itemsets ending in de, ce, be, and ae. Before solving these subproblems, it must first convert
the prefix paths into a conditional FP-tree, which is structurally similar to an FP-tree, except
it is used to find frequent itemsets ending with a particular suffix.
– First, the support counts along the prefix paths must be updated because some of the counts include
transactions that do not contain item e.
– The prefix paths are truncated by removing the nodes for e.
– After updating the support counts along the prefix paths, some of the items may no longer be frequent
• the node b appears only once and has a support count equal to 1, which means that there is only one transaction
that contains both b and e. Item b can be safely ignored from subsequent analysis because all itemsets ending in
be must be infrequent.
4. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding frequent
itemsets ending in de, ce, and ae.

Data Mining 67
Prefix Paths Ending with e
TID Items FP Tree null
1 {A,B}
2 {B,C,D}
A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} B:5 C:3
6 {A,B,C,D} C:1 D:1
7 {B,C}
Header table D:1
8 {A,B,C} C:3 E:1
Item Pointer D:1 E:1
9 {A,B,D} A D:1
10 {B,C,E} B E:1
C D:1
D null
E

A:7 B:3

C:3
C:1 D:1

Prefix Paths Ending with e


D:1 E:1 E:1

E:1
Data Mining 68
Conditional FP-Tree for e
TID Items null
1 {A,B}
Prefix Paths Ending with e
2 {B,C,D}
A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} C:3
6 {A,B,C,D} C:1 D:1
7 {B,C}
8 {A,B,C}
9 {A,B,D}
D:1 E:1 E:1 Conditional FP-Tree for e
10 {B,C,E} E:1 null
minsup=2

A:2 C:1
To create Conditional FP-Tree for e
• Update support counts because paths without e are removed
• e is frequent (support=3), Remove e nodes from prefix paths
C:1 D:1
• Remove infrequent nodes

D:1

Data Mining 69
Conditional FP-Tree for de
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with de
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D} C:1 D:1
10 {B,C,E}
minsup=2 D:1
D:1

de is frequent (support=2)

Conditional FP-Tree for de null

A:2

Data Mining 70
Conditional FP-Tree for ce
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with ce
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2 C:1
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D} C:1
10 {B,C,E}
minsup=2 D:1

ce is frequent (support=2)

Conditional FP-Tree for ce null

A:1

Data Mining 71
Conditional FP-Tree for ae
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with ae
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D}
10 {B,C,E}
ae is frequent (support=2)
minsup=2 D:1

Conditional FP-Tree for ae null

Data Mining 72
Frequent Itemsets Ordered by Suffixes
TID Items
Suffix Frequent Itemsets
1 {A,B}
2 {B,C,D} E {E}, {D,E}, {A,D,E}, {C,E}, {A, E},
3 {A,C,D,E} D {D}, {C,D}, {B,C,D}, {A,C,D}, {B,D}, {A,B,D}, {A,D}
4 {A,D,E}
5 {A,B,C} C {C}, {B,C}, {A,B,C}, {A,C}
6 {A,B,C,D} B {B}, {A,B}
7 {B,C}
8 {A,B,C} A {A}
9 {A,B,D}
10 {B,C,E} null
minsup=2
A:7 B:3

B:5 C:3
C:1 D:1
FP Tree
Header table D:1
C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D
E Data Mining 73
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 74
Evaluation of Association Patterns
• Association rule algorithms tend to produce too many rules
– many of them are uninteresting or redundant
– {A,B}  {D} is Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence
– An association rule X −→ Y is redundant if there exists another rule X’→ Y’,
where X is a subset of X’ and Y is a subset of Y’, such that the support and
confidence for both rules are identical.

• Interestingness measure can be used to prune/rank the derived patterns

• In the original formulation of association rules, support & confidence are the only
measures used

Data Mining 75
Computing Interestingness Measure
• Given a rule X  Y, information needed to compute rule interestingness can be
obtained from a contingency table

Contingency table for X  Y

Data Mining 76
Drawback of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75 = support({Tea,Coffee}) / support({Tea})

but P(Coffee) = 0.9


 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375

Data Mining 77
Measure for Association Rules
• So, what kind of rules do we really want?
– Confidence(X  Y) should be sufficiently high
• To ensure that people who buy X will more likely buy Y than not buy Y

– Confidence(X  Y) > support(Y)


• Otherwise, rule will be misleading because having item X actually reduces the
chance of having item Y in the same transaction
• Is there any measure that capture this constraint?
– Answer: Yes. There are many of them.

Data Mining 78
Statistical Independence
• Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)

– P(SB) = 420/1000 = 0.42


– P(S)  P(B) = 0.6  0.7 = 0.42

– P(SB) = P(S)  P(B) => Statistical independence


– P(SB) > P(S)  P(B) => Positively correlated
– P(SB) < P(S)  P(B) => Negatively correlated

Data Mining 79
Statistical-Based Measures for Interestingness
• Statistical-Based Measures use statistical dependence information.
• Two of them are Lift and Interest (they are equal).

Lift = P(Y|X) / P(Y)


Interest = P(X,Y) / P(X) P(Y)

Lift(A,B) = conf(A→B) / support(B)


= support(A ∪ B) / support(A) support(B)
Interest(A,B) = support(A ∪ B) / support(A) support(B)

= 1 𝑖𝑓 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡


Interest(A,B) ቐ > 1 𝑖𝑓 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑟𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑
< 1 𝑖𝑓 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑟𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑

Data Mining 80
Example: Lift/Interest

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee


Confidence= P(Coffee|Tea) = 0.75 = support({Tea,Coffee}) / support({Tea})
but P(Coffee) = 0.9
 Lift = 0.75/0.9 = 0.8333 (< 1, therefore is negatively correlated)

Data Mining 81
Example: Lift/Interest

• play basketball  eat cereal [40%, 66.7%] is misleading


– The overall % of students eating cereal is 75% > 66.7%.
• play basketball  not eat cereal [20%, 33.3%] is more accurate, although
with lower support and confidence
Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

2000 / 5000
lift( B, C )   0.89
3000 / 5000 * 3750 / 5000
1000 / 5000
lift( B, C )   1.33
3000 / 5000 *1250 / 5000

Data Mining 82
Limitations of Interest Factor
• We expect the words data and mining to appear together more frequently than the
words compiler and mining in a collection of computer science articles.

Contingency tables for word pairs { p,q} and { r,s}.


• The interest factor for {p,q} is 1.02 and for {r, s} is 4.08.
– Although p and q appear together in 88% of the documents, their interest factor is close to 1, which is
the value when p and q are statistically independent.
– On the other hand, the interest factor for {r, s} is higher than {p, q} even though r and s seldom appear
together in the same document.
– Confidence is perhaps the better choice in this situation because it considers the association between p
and q (94.6%) to be much stronger than that between r and s (28.6%).

Data Mining 83
Different
Measures
• There are lots of
measures proposed in
the literature

• Some measures are


good for certain
applications, but not
for others

• What criteria should


we use to determine
whether a measure is
good or bad?

Data Mining 84
Properties of A Good Measure
3 properties a good measure M must satisfy:

– M(A,B) = 0 if A and B are statistically independent

– M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain
unchanged

– M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or
P(A)] remain unchanged

Data Mining 85

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy