0% found this document useful (0 votes)

9 views

dmunit2

Uploaded by

akshayakeerthi1920

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

dmunit2

Uploaded by

akshayakeerthi1920

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Association Rule Mining

Data Mining 1
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 2
Frequent Pattern
• Frequent Pattern: a pattern (a set of items, subsequences, substructures, etc.) that
occurs frequently in a data set.
• For example, a set of items, such as milk and bread, that appear frequently together in
a transaction data set is a frequent itemset.
• A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern.
• A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern.

Data Mining 3
Frequent Pattern
Market Basket Analysis
• Frequent Pattern: a pattern that occurs frequently in a data set.
– A set of items that appear frequently together in a transaction data set is called as a
frequent itemset.
• An example of frequent itemset mining is market basket analysis.
– This process analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”.
– If we think of the universe as the set of items available at the store, then each item has a
Boolean variable representing the presence or absence of that item.
– Each basket can then be represented by a Boolean vector of values assigned to these
variables.
– The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together.
– These patterns can be represented in the form of association rules.

Data Mining 4
Basic Concepts: Frequent Patterns
Tid Items bought
• itemset: A set of one or more items
10 Beer, Nuts, Diaper
– k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs • (absolute) support of X: Frequency of an
40 Nuts, Eggs, Milk itemset X.
50 Nuts, Coffee, Diaper, Eggs, Milk – Absolute Support of {Beer} is 3
• (relative) support of X is the fraction of
Customer Customer transactions that contains X (i.e., the
buys both buys diaper probability that a transaction contains X).
– Relative Support of {Beer} is 3/5
• An itemset X is frequent if X’s support
is no less than a minsup threshold.
Customer
buys beer

Data Mining 5
Basic Concepts: Association Rules
Association Rule
– An implication expression of the form X  Y, where X and Y are itemsets

Association Rule Mining:

• Find all the rules X  Y with minimum support and minimum confidence
– support, probability that a transaction contains XY : P(XY)
• Fraction of transactions that contain both X and Y
– confidence, conditional probability that a transaction having X also contains Y :
P(Y/X) = support(XY) / support(X)
• Measures how often items in Y appear in transactions that contain X

Data Mining 6
Basic Concepts: Association Rules
Association Rule
– An implication expression of the form X  Y, where X and Y are itemsets

Association Rule Mining:

• Find all the rules X  Y with minimum support and minimum confidence

Let minsup = 50%, minconf = 50%

Frequent Pattterns:
Tid Items bought
{Beer}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, 10 Beer, Nuts, Diaper
{Beer, Diaper}:3 20 Beer, Coffee, Diaper
Association Rules: 30 Beer, Diaper, Eggs
– { Beer }  { Diaper } (60%, 100%) 40 Nuts, Eggs, Milk
– { Diaper }  { Beer } (60%, 75%)
50 Nuts, Coffee, Diaper,
Eggs, Milk

Data Mining 7
Why Use Support and Confidence?
• Support is an important measure because a rule that has very low support may occur
simply by chance.
– A low support rule may be uninteresting from a business perspective because it may not be
profitable to promote items that customers seldom buy together
– For these reasons, support is often used to eliminate uninteresting rules

• Confidence measures the reliability of the inference made by a rule.

– For a given rule X  Y, the higher the confidence, the more likely it is for Y to be present
in transactions that contain X.

• Association analysis results should be interpreted with caution.

– The inference made by an association rule does not necessarily imply causality.
– Instead, it suggests a strong co-occurrence relationship between items in the antecedent and
consequent of the rule.

Data Mining 8
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all rules
having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally not feasible!

Data Mining 9
Mining Association Rules

TID Items Example of Rules:

1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different
confidence
• Thus, we may decouple the support and confidence requirements

Data Mining 10
Association Rule Mining
• The problem of mining association rules can be reduced to that of mining frequent
itemsets.

• In general, association rule mining can be viewed as a two-step process:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, minsup.
– Generate all itemsets whose support  minsup
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.
– Generate high confidence rules from each frequent itemset, where each rule is a
binary partitioning of a frequent itemset

– Frequent itemset generation is still computationally expensive

Data Mining 11
Association Rules - Example

Transactions minsup = 0.5 minconf=0.7

A,B,D
A,B,C,D • Find frequent itemsets and association rules satisfying
minsup and minconf.
A
A,B,C
B,C
B

Data Mining 12
Association Rules - Example

Transactions minsup = 0.5 minconf=0.7

A,B,D
• Find frequent itemsets and association rules satisfying
A,B,C,D minsup and minconf.
A
A,B,C Frequent Itemsets:
1-itemsets: {A} support({A}) = 4/6
B,C {B} support({B}) = 5/6
B {C} support({C}) = 3/6
2-itemsets: {A,B} support({A,B}) = 3/6
{B,C} support({B,C}) = 3/6
Association Rules:
A→B conf(A→B) = 3/4
C→B conf(C→B) = 3/3

Data Mining 13
Frequent Itemset Generation

An itemset lattice

Given d items, there are 2d

possible candidate itemsets

Data Mining 14
Frequent Itemset Generation

• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate

– Complexity ~ O(NMW) => Expensive since M = 2d !!!
Data Mining 15
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules

Data Mining 16
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
– The Apriori principle is an effective way to eliminate some of the candidate
itemsets without counting their support values.

• Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction

• Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases

Data Mining 17
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 18
Reducing Number of Candidates
Apriori Principle
• Apriori Principle: If an itemset is frequent, then all of its subsets
must also be frequent.

• Apriori principle holds due to the following property of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support

Data Mining 19
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE

Data Mining 20
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1

Minimum Support = 3 Generate 1-itemset candidates

If every subset is considered,

6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 21
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 3
Eliminate infrequent 1-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 22
Illustrating Apriori Principle

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 {Bread, Beer } (No need to generate candidates
Eggs 1 {Bread,Diaper}
involving Coke or Eggs)
{Beer, Milk}
{Diaper, Milk}
{Beer,Diaper}

Minimum Support = 3
Generate 2-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 23
Illustrating Apriori Principle

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate candidates
Eggs 1 {Bread,Diaper} 3 involving Coke or Eggs)
{Beer,Milk} 2
{Diaper,Milk} 3
{Beer,Diaper} 3
Minimum Support = 3
Eliminate infrequent 2-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16

Data Mining 24
Illustrating Apriori Principle

Item Count Items (1-itemsets)

Generate 3-itemset candidates

Data Mining 25
Illustrating Apriori Principle

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate candidates
Eggs 1
{Bread,Diaper} 3 involving Coke or Eggs)
{Milk,Beer} 2
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C
1 2 3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13 Prune 3-itemset candidates with infrequent 2-itemsets
Eliminate infrequent 3-itemset candidates
Data Mining 26
Apriori Algorithm:
Finding Frequent Itemsets Using Candidate Generation
• Apriori pruning principle: If there is any itemset which is infrequent, its superset
should not be generated/tested!

Apriori Algorithm: Fk: frequent k-itemsets Lk: candidate k-itemsets

• Let k=1
• Generate F1 = {frequent 1-itemsets}
• Repeat until Fk is empty
– Candidate Generation: Generate Lk+1 from Fk
– Candidate Pruning: Prune candidate itemsets in Lk+1 containing subsets of length k that
are infrequent
– Support Counting: Count the support of each candidate in Lk+1 by scanning the DB
– Candidate Elimination: Eliminate candidates in Lk+1 that are infrequent, leaving only
those that are frequent => Fk+1

Data Mining 27
Apriori Algorithm:
Candidate Generation: Fk-1 x Fk-1 Method
• Merge two frequent (k-1)-itemsets if their first (k-2) items are identical

• F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE

– Do not merge(ABD,ACD) because they share only prefix of length 1 instead of

length 2

• L4 = {ABCD,ABCE,ABDE} is the set of candidate 4-itemsets generated

Data Mining 28
Apriori Algorithm:
Candidate Pruning
• Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of frequent 3-itemsets

• L4 = {ABCD,ABCE,ABDE} is the set of candidate 4-itemsets generated

• Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent

• After candidate pruning: L4 = {ABCD}

Data Mining 29
Apriori Algorithm:
Support Counting of Candidate Itemsets
• Scan the database of transactions to determine the support of each candidate itemset
– Must match every candidate itemset against every transaction, which is an
expensive operation

• To reduce the number of comparisons, store the candidates in a hash structure

– Instead of matching each transaction against every candidate, match it against
candidates contained in the hashed buckets

Transactions Hash Structure

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets

Data Mining 30
Apriori Algorithm

Data Mining 31
Apriori Algorithm

Data Mining 32
Apriori Algorithm - An Example

Data Mining 33
Support Counting Using Hash Tree
• Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
– Must match every candidate itemset against every transaction, which is an
expensive operation

• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a transaction

Data Mining 34
Support Counting Using Hash Tree
Subset Operation
• Enumerating subsets of three items from a transaction t

Data Mining 35
Support Counting Using Hash Tree
Generate Candidate Hash Tree
• Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7},
{3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

• We need: Hash function

– HashFunc: mod 3

Hash function 234

3,6,9 567
1,4,7
2,5,8 145 345 356 367
136 357 368
689
124
457 125 159
458

Data Mining 36
Support Counting Using Hash Tree
Generate Candidate Hash Tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining 37
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts

1 2 3 5 6 transaction Hash Function

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining 38
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts

1 2 3 5 6 transaction Hash Function

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining 39
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 9 out of 15 candidates
Data Mining 40
Factors Affecting Complexity of Apriori Algorithm
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may also
increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with number of
transactions
• Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and number of subsets in a transaction
increases with its width

Data Mining 41
Effect of Support Threshold
• Effect of support threshold on the number of candidate and frequent itemsets

Number of candidate itemsets Number of frequent itemsets

Data Mining 42
Effect of Average Transaction Width
• Effect of average transaction width on the number of candidate and frequent itemsets

Number of candidate itemsets Number of frequent itemsets

Data Mining 43
Effect of Support Distribution
• How to set the appropriate minsup threshold?
– If minsup is set too high, we could miss itemsets involving interesting rare items
(e.g., expensive products)

– If minsup is set too low, it is computationally expensive and the number of

itemsets is very large

• Using a single minimum support threshold may not be effective

Data Mining 44
Multiple Minimum Support
• How to apply multiple minimum supports?
– MS(i): minimum support for item i
– e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%

– Challenge: Support is no longer anti-monotone

• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%

• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

Data Mining 45
Multiple Minimum Support
• Order the items according to their minimum support (in ascending order)
– e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– Ordering: Broccoli, Salmon, Coke, Milk

• Need to modify Apriori such that:

– L1 : set of frequent items
– F1 : set of items whose support is  MS(1)
where MS(1) is mini( MS(i) )
– C2 : candidate itemsets of size 2 is generated from F1
instead of L1

Data Mining 46
Multiple Minimum Support
• Modifications to Apriori:
– In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets of size k

– Pruning step has to be modified:

• Prune only if subset contains the first item
• e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
• {Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.

Data Mining 47
Rule Generation in Apriori Algorithm
• Given a frequent itemset L, find all non-empty subsets f  L such that candidate
rule f  L – f satisfies the minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC  D ABD  C ACD  B BCD  A
D  ABC C  ABD B  ACD A  BCD ,

AB  CD AC  BD AD  BC
CD  AB BD  AC BC  AD

• If |L| = k, then there are 2k – 2 candidate association rules

– (ignoring L   and   L)

Data Mining 48
Rule Generation in Apriori Algorithm
• How to efficiently generate rules from frequent itemsets?

• In general, confidence does not have an anti-monotone property

c(ABCD) can be larger or smaller than c(ABD)

• But confidence of rules generated from the same itemset has an anti-monotone
property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:

c(ABCD)  c(ABCD)  c(ABCD)

– Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Data Mining 49
Rule Generation in Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules
Data Mining 50
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 51
Compact Representation of Frequent Itemsets
• The number of frequent itemsets produced from a transaction data set can be very
large.
• Some produced itemsets can be redundant because they have identical support as their
supersets

• It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived.  Need a compact representation
– Maximal Frequent Itemsets and
– Closed Frequent Itemsets

Data Mining 52
Maximal Frequent Itemsets
Maximal Frequent Itemset: A maximal frequent itemset is defined as a frequent
itemset for which none of its immediate supersets are frequent.

• Maximal frequent itemsets effectively provide a compact representation of frequent

itemsets.
• Maximal frequent itemsets form the smallest set of itemsets from which all frequent
itemsets can be derived.

Data Mining 53
Maximal Frequent Itemsets
All frequent itemsets can be derived from
maximal frequent itemsets ad, ace, bcde.
Any frequent itemset 
a maximal frequent itemset

Data Mining 54
Maximal Frequent Itemsets
• Despite providing a compact representation, maximal frequent itemsets do not contain
the support information of their subsets.
• For example, the support of the maximal frequent itemsets {a, c, e}, {a, d}, and
{b,c,d,e} do not provide any hint about the support of their subsets.
• An additional pass over the data set is therefore needed to determine the support
counts of the non-maximal frequent itemsets.

• It might be desirable to have a minimal representation of frequent itemsets that

preserves the support information.  Closed Frequent Itemsets

Data Mining 55
Closed Frequent Itemsets
Closed Itemset: An itemset X is closed if none of its immediate supersets has exactly
the same support count as X.

• Closed itemsets provide a minimal representation of itemsets without losing their

support information.
• Put another way, X is not closed if at least one of its immediate supersets has the same
support count as X.

Closed Frequent Itemset: An itemset is a closed frequent itemset if it is closed and

its support is greater than or equal to minsup.

Data Mining 56
Closed Frequent Itemsets
All subsets of a closed frequent
itemset are frequent and their
supports is greater than or equal to
the support of that closed frequent
itemset.

For example, all subsets of a closed

frequent itemset abc are frequent
and their supports  support of abc.

Data Mining 57
Maximal vs Closed Itemsets

Closed but not maximal

frequent itemsets
Closed and maximal
frequent itemsets

# Closed = 9
# Maximal = 4

Data Mining 58
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

Data Mining 59
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 60
FP-Growth (Frequent Pattern Growth) Algorithm
• FP-growth algorithm that takes a radically different approach to discovering
frequent itemsets.
– The algorithm does not subscribe to the generate-and-test paradigm of Apriori

• FP-growth algorithm encodes the data set using a compact data structure called an
FP-tree and extracts frequent itemsets directly from this structure.
– Use a compressed representation of the database using an FP-tree
– Once an FP-tree has been constructed, it uses a recursive divide-and-conquer
approach to mine the frequent itemsets

Data Mining 61
FP-Tree Construction
• An FP-tree is a compressed representation of the input data.

• It is constructed by reading the data set one transaction at a time and mapping each
transaction onto a path in the FP-tree.
– Different transactions can have several items in common, their paths may overlap.
– The more the paths overlap with one another, the more compression we can
achieve using the FP-tree structure.

Data Mining 62
FP-Tree Construction
• Each node in the tree contains the label of an item along with a counter that shows the
number of transactions mapped onto the given path.
– Initially, the FP-tree contains only the root node represented by the null symbol.
– Every transaction maps onto one of the paths in the FP-tree.

• The size of an FP-tree is typically smaller than the size of the uncompressed data
because many transactions in market basket data often share a few items in common.
– best-case scenario, all transactions have same set of items
 FP-tree contains only a single branch.
– worst-case scenario happens when every transaction has a unique set of items
 FP-tree is effectively the same as the size of the original data.
– physical storage requirement for FP-tree is higher because it requires additional
space to store pointers between nodes and counters for each item.

Data Mining 63
FP-Tree Construction
TID Items After reading TID=1:
1 {A,B}
2 {B,C,D}
null
3 {A,C,D,E}
4 {A,D,E} A:1
5 {A,B,C}
6 {A,B,C,D} B:1
7 {B,C} After reading TID=3:
8 {A,B,C} After reading TID=2:
null
9 {A,B,D}
10 {B,C,E} B:1 null
A:2
A:1 B:1
B:1 C:1 C:1

D:1 B:1 C:1

D:1
E:1 D:1
Data Mining 64
FP-Tree Construction
TID Items After reading all transactions:
1 {A,B}
Transaction
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} A:7 B:3
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E} B:5 C:3
C:1 D:1

Header table D:1

C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D
E Pointers are used to assist frequent
sorted itemset generation
Data Mining 65
Frequent Itemset Generation
in FP-Growth Algorithm
• FP-growth is an algorithm that generates frequent itemsets from an FP-tree by
exploring the tree in a bottom-up fashion.
– This bottom-up strategy for finding frequent itemsets ending with a particular item is equivalent to the
suffix-based approach
– Since every transaction is mapped onto a path in the FP-tree, we can derive the frequent itemsets
ending with a particular item, say e, by examining only the paths containing node e.
– The algorithm looks for frequent itemsets ending in e first, followed by d, c, b, and finally, a.
• FP-growth finds all the frequent itemsets ending with a particular suffix by employing
a divide-and-conquer strategy to split the problem into smaller subproblems.
– To find all frequent itemsets ending in e, we must first check whether the itemset {e} itself is frequent.
– If it is frequent, we consider the subproblem of finding frequent itemsets ending in de, followed by ce,
be, and ae.
– In turn, each of these subproblems are further decomposed into smaller subproblems.
– By merging the solutions obtained from the subproblems, all the frequent itemsets ending in e can be
found.

Data Mining 66
Finding Frequent Itemsets Ending with e
1. The first step is to gather all the paths containing node e. These initial paths are called prefix
paths
2. From the prefix paths, the support count for e is obtained by adding the support counts
associated with node e. Assuming that the minimum support count is 2, {e} is declared a
frequent itemset because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent
itemsets ending in de, ce, be, and ae. Before solving these subproblems, it must first convert
the prefix paths into a conditional FP-tree, which is structurally similar to an FP-tree, except
it is used to find frequent itemsets ending with a particular suffix.
– First, the support counts along the prefix paths must be updated because some of the counts include
transactions that do not contain item e.
– The prefix paths are truncated by removing the nodes for e.
– After updating the support counts along the prefix paths, some of the items may no longer be frequent
• the node b appears only once and has a support count equal to 1, which means that there is only one transaction
that contains both b and e. Item b can be safely ignored from subsequent analysis because all itemsets ending in
be must be infrequent.
4. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding frequent
itemsets ending in de, ce, and ae.

Data Mining 67
Prefix Paths Ending with e
TID Items FP Tree null
1 {A,B}
2 {B,C,D}
A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} B:5 C:3
6 {A,B,C,D} C:1 D:1
7 {B,C}
Header table D:1
8 {A,B,C} C:3 E:1
Item Pointer D:1 E:1
9 {A,B,D} A D:1
10 {B,C,E} B E:1
C D:1
D null
E

A:7 B:3

C:3
C:1 D:1

Prefix Paths Ending with e

D:1 E:1 E:1

E:1
Data Mining 68
Conditional FP-Tree for e
TID Items null
1 {A,B}
Prefix Paths Ending with e
2 {B,C,D}
A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} C:3
6 {A,B,C,D} C:1 D:1
7 {B,C}
8 {A,B,C}
9 {A,B,D}
D:1 E:1 E:1 Conditional FP-Tree for e
10 {B,C,E} E:1 null
minsup=2

A:2 C:1
To create Conditional FP-Tree for e
• Update support counts because paths without e are removed
• e is frequent (support=3), Remove e nodes from prefix paths
C:1 D:1
• Remove infrequent nodes

D:1

Data Mining 69
Conditional FP-Tree for de
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with de
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D} C:1 D:1
10 {B,C,E}
minsup=2 D:1
D:1

de is frequent (support=2)

Conditional FP-Tree for de null

A:2

Data Mining 70
Conditional FP-Tree for ce
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with ce
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2 C:1
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D} C:1
10 {B,C,E}
minsup=2 D:1

ce is frequent (support=2)

Conditional FP-Tree for ce null

A:1

Data Mining 71
Conditional FP-Tree for ae
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with ae
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D}
10 {B,C,E}
ae is frequent (support=2)
minsup=2 D:1

Conditional FP-Tree for ae null

Data Mining 72
Frequent Itemsets Ordered by Suffixes
TID Items
Suffix Frequent Itemsets
1 {A,B}
2 {B,C,D} E {E}, {D,E}, {A,D,E}, {C,E}, {A, E},
3 {A,C,D,E} D {D}, {C,D}, {B,C,D}, {A,C,D}, {B,D}, {A,B,D}, {A,D}
4 {A,D,E}
5 {A,B,C} C {C}, {B,C}, {A,B,C}, {A,C}
6 {A,B,C,D} B {B}, {A,B}
7 {B,C}
8 {A,B,C} A {A}
9 {A,B,D}
10 {B,C,E} null
minsup=2
A:7 B:3

B:5 C:3
C:1 D:1
FP Tree
Header table D:1
C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D
E Data Mining 73
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns

Data Mining 74
Evaluation of Association Patterns
• Association rule algorithms tend to produce too many rules
– many of them are uninteresting or redundant
– {A,B}  {D} is Redundant if {A,B,C}  {D} and {A,B}  {D}
have same support & confidence
– An association rule X −→ Y is redundant if there exists another rule X’→ Y’,
where X is a subset of X’ and Y is a subset of Y’, such that the support and
confidence for both rules are identical.

• Interestingness measure can be used to prune/rank the derived patterns

• In the original formulation of association rules, support & confidence are the only
measures used

Data Mining 75
Computing Interestingness Measure
• Given a rule X  Y, information needed to compute rule interestingness can be
obtained from a contingency table

Contingency table for X  Y

Data Mining 76
Drawback of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75 = support({Tea,Coffee}) / support({Tea})

but P(Coffee) = 0.9

 Although confidence is high, rule is misleading
 P(Coffee|Tea) = 0.9375

Data Mining 77
Measure for Association Rules
• So, what kind of rules do we really want?
– Confidence(X  Y) should be sufficiently high
• To ensure that people who buy X will more likely buy Y than not buy Y

– Confidence(X  Y) > support(Y)

• Otherwise, rule will be misleading because having item X actually reduces the
chance of having item Y in the same transaction
• Is there any measure that capture this constraint?
– Answer: Yes. There are many of them.

Data Mining 78
Statistical Independence
• Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)

– P(SB) = 420/1000 = 0.42

– P(S)  P(B) = 0.6  0.7 = 0.42

– P(SB) = P(S)  P(B) => Statistical independence

– P(SB) > P(S)  P(B) => Positively correlated
– P(SB) < P(S)  P(B) => Negatively correlated

Data Mining 79
Statistical-Based Measures for Interestingness
• Statistical-Based Measures use statistical dependence information.
• Two of them are Lift and Interest (they are equal).

Lift = P(Y|X) / P(Y)

Interest = P(X,Y) / P(X) P(Y)

Lift(A,B) = conf(A→B) / support(B)

= support(A ∪ B) / support(A) support(B)
Interest(A,B) = support(A ∪ B) / support(A) support(B)

= 1 𝑖𝑓 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡

Interest(A,B) ቐ > 1 𝑖𝑓 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑟𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑
< 1 𝑖𝑓 𝐴 𝑎𝑛𝑑 𝐵 𝑎𝑟𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑙𝑦 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑒𝑑

Data Mining 80
Example: Lift/Interest

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence= P(Coffee|Tea) = 0.75 = support({Tea,Coffee}) / support({Tea})
but P(Coffee) = 0.9
 Lift = 0.75/0.9 = 0.8333 (< 1, therefore is negatively correlated)

Data Mining 81
Example: Lift/Interest

• play basketball  eat cereal [40%, 66.7%] is misleading

– The overall % of students eating cereal is 75% > 66.7%.
• play basketball  not eat cereal [20%, 33.3%] is more accurate, although
with lower support and confidence
Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

2000 / 5000
lift( B, C )   0.89
3000 / 5000 * 3750 / 5000
1000 / 5000
lift( B, C )   1.33
3000 / 5000 *1250 / 5000

Data Mining 82
Limitations of Interest Factor
• We expect the words data and mining to appear together more frequently than the
words compiler and mining in a collection of computer science articles.

Contingency tables for word pairs { p,q} and { r,s}.

• The interest factor for {p,q} is 1.02 and for {r, s} is 4.08.
– Although p and q appear together in 88% of the documents, their interest factor is close to 1, which is
the value when p and q are statistically independent.
– On the other hand, the interest factor for {r, s} is higher than {p, q} even though r and s seldom appear
together in the same document.
– Confidence is perhaps the better choice in this situation because it considers the association between p
and q (94.6%) to be much stronger than that between r and s (28.6%).

Data Mining 83
Different
Measures
• There are lots of
measures proposed in
the literature

• Some measures are

good for certain
applications, but not
for others

• What criteria should

we use to determine
whether a measure is
good or bad?

Data Mining 84
Properties of A Good Measure
3 properties a good measure M must satisfy:

– M(A,B) = 0 if A and B are statistically independent

– M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain
unchanged

– M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or
P(A)] remain unchanged

Data Mining 85

Chatbot Final Year Project
No ratings yet
Chatbot Final Year Project
24 pages
Association Rules & Frequent Itemsets: The Market-Basket Problem
No ratings yet
Association Rules & Frequent Itemsets: The Market-Basket Problem
5 pages
04 Frequent Patterns Analysis
No ratings yet
04 Frequent Patterns Analysis
37 pages
Unit 2
No ratings yet
Unit 2
14 pages
association rule
No ratings yet
association rule
22 pages
Data Mining Task - Association Rule Mining
No ratings yet
Data Mining Task - Association Rule Mining
30 pages
Unit 4 DWM by DR KSR Association - Analysis
No ratings yet
Unit 4 DWM by DR KSR Association - Analysis
68 pages
New Microsoft Power Point Presentation
No ratings yet
New Microsoft Power Point Presentation
18 pages
Rule Mining
No ratings yet
Rule Mining
20 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
06FPBasic
No ratings yet
06FPBasic
77 pages
Association Rule
No ratings yet
Association Rule
17 pages
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
No ratings yet
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
41 pages
Lect 6
No ratings yet
Lect 6
74 pages
DM Association
No ratings yet
DM Association
43 pages
Slides
No ratings yet
Slides
92 pages
Association Rule Mining
No ratings yet
Association Rule Mining
24 pages
UNIT 2 Updated (1) (1)
No ratings yet
UNIT 2 Updated (1) (1)
50 pages
UNIT 4 .3 ASSOCIATION ANALYSIS
No ratings yet
UNIT 4 .3 ASSOCIATION ANALYSIS
50 pages
Association
No ratings yet
Association
54 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
BITS WASE Data Mining Session 5 PDF
No ratings yet
BITS WASE Data Mining Session 5 PDF
83 pages
Unit 5
No ratings yet
Unit 5
40 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-2: Tid Items
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-2: Tid Items
4 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
102 pages
Association Rule Mining
No ratings yet
Association Rule Mining
92 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
Association: Market Basket Analysis
No ratings yet
Association: Market Basket Analysis
40 pages
class 4-Associative Analysis
No ratings yet
class 4-Associative Analysis
42 pages
Unit 3- Asso Rule Mining
No ratings yet
Unit 3- Asso Rule Mining
27 pages
ASSOCIATION ANALYSIS
No ratings yet
ASSOCIATION ANALYSIS
26 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
Chap5 Basic Association Analysis
No ratings yet
Chap5 Basic Association Analysis
105 pages
Chapter 5
No ratings yet
Chapter 5
37 pages
dataanalytics unit-4
No ratings yet
dataanalytics unit-4
23 pages
Rule Mining by Akshay Rele
No ratings yet
Rule Mining by Akshay Rele
42 pages
Chap5 Basic Association Analysis
No ratings yet
Chap5 Basic Association Analysis
105 pages
Data Mining frequent patterns
No ratings yet
Data Mining frequent patterns
22 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
UNIT-2 DMA (2)
No ratings yet
UNIT-2 DMA (2)
68 pages
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
104 pages
AprioriTID Algorithm Improved From Apriori Algorithm
No ratings yet
AprioriTID Algorithm Improved From Apriori Algorithm
5 pages
Arm PPT
No ratings yet
Arm PPT
15 pages
Unit 3 1
No ratings yet
Unit 3 1
34 pages
Module1 Part2
No ratings yet
Module1 Part2
17 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Unit-4_Part-1
No ratings yet
Unit-4_Part-1
152 pages
Association Rule Mining Task
No ratings yet
Association Rule Mining Task
40 pages
Association Rule Mining
No ratings yet
Association Rule Mining
8 pages
DSTBD_9-DMassrules
No ratings yet
DSTBD_9-DMassrules
98 pages
CS2202_AssociationRuleMining
No ratings yet
CS2202_AssociationRuleMining
59 pages
Chap5-Association Analysis
No ratings yet
Chap5-Association Analysis
29 pages
Chap5-Association Analysis
No ratings yet
Chap5-Association Analysis
102 pages
Statistics: 1001 Practice Problems For Dummies (+ Free Online Practice)
From Everand
Statistics: 1001 Practice Problems For Dummies (+ Free Online Practice)
The Experts at Dummies
No ratings yet
W5. Enhanced Entity-Relationship Modeling
No ratings yet
W5. Enhanced Entity-Relationship Modeling
23 pages
Data Flow Diagram
No ratings yet
Data Flow Diagram
13 pages
5th Sem Syllabus
No ratings yet
5th Sem Syllabus
4 pages
Column-vs-Row databases
No ratings yet
Column-vs-Row databases
12 pages
Free Bar Graph Maker - Create Bar Chart Online Draxlr
No ratings yet
Free Bar Graph Maker - Create Bar Chart Online Draxlr
1 page
Deep Security 11 Certified Professional - Exam: Erro
No ratings yet
Deep Security 11 Certified Professional - Exam: Erro
19 pages
Suguna_Profile_BIDW Consultant
No ratings yet
Suguna_Profile_BIDW Consultant
5 pages
Cntlist 5
No ratings yet
Cntlist 5
2 pages
Ejercicios Capitulo 3
No ratings yet
Ejercicios Capitulo 3
8 pages
BCA Semester VI Project Details PDF
No ratings yet
BCA Semester VI Project Details PDF
11 pages
Aravind Chinthala: SAP Data Migration Consultant
No ratings yet
Aravind Chinthala: SAP Data Migration Consultant
1 page
Hadoop-Hive Report
No ratings yet
Hadoop-Hive Report
17 pages
Information Technology 11
No ratings yet
Information Technology 11
72 pages
Data Warehousing and Mining Unit 1 (1) (1)
No ratings yet
Data Warehousing and Mining Unit 1 (1) (1)
15 pages
156 Megha Paliwal
No ratings yet
156 Megha Paliwal
50 pages
Now - Now - NOw
No ratings yet
Now - Now - NOw
23 pages
ArtiosCAD Installation Advisor
No ratings yet
ArtiosCAD Installation Advisor
13 pages
oracle_a_Part4
No ratings yet
oracle_a_Part4
1 page
Air and Environment
No ratings yet
Air and Environment
19 pages
Deploying SQL Server 2016 PowerPivot
No ratings yet
Deploying SQL Server 2016 PowerPivot
64 pages
Ais C1
No ratings yet
Ais C1
13 pages
Practical Questions - 2024 - Set2
No ratings yet
Practical Questions - 2024 - Set2
3 pages
Low-code platforms and languages the future of software development
No ratings yet
Low-code platforms and languages the future of software development
8 pages
Eda Research
No ratings yet
Eda Research
11 pages
Active Directory Questions & Answer
No ratings yet
Active Directory Questions & Answer
4 pages
How To Reset A Postgresql Password?
No ratings yet
How To Reset A Postgresql Password?
3 pages
Coronel PPT Ch03
No ratings yet
Coronel PPT Ch03
38 pages
MONGO DB Lab Manual-1
No ratings yet
MONGO DB Lab Manual-1
54 pages
Geospatial Analysis With Sql A Handson Guide To Performing Geospatial Analysis By Unlocking The Syntax Of Spatial Sql Mcclain pdf download
100% (1)
Geospatial Analysis With Sql A Handson Guide To Performing Geospatial Analysis By Unlocking The Syntax Of Spatial Sql Mcclain pdf download
85 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.