0% found this document useful (0 votes)

10 views

CS2202_AssociationRuleMining

Association Rule Mining involves finding rules that predict the occurrence of items based on transactions, particularly in market-basket analysis. The process includes generating frequent itemsets and evaluating rules based on support and confidence thresholds. The Apriori algorithm is a common method used to efficiently generate these itemsets while reducing computational complexity.

Uploaded by

Sparsh Rastogi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

CS2202_AssociationRuleMining

Uploaded by

Sparsh Rastogi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Association Rule Mining

CS2202

1
ASSOCIATION RULE MINING

 Given a set of transactions, find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction

Reference: “Introduction to Data Mining“ by Tan, Steinbach, Karpatne, Kumar

2
THE TASK
 Two ways of defining the task
 General
 Input: A collection of instances
 Output: rules to predict the values of any attribute(s) (not just the class attribute) from
values of other attributes
 E.g. if temperature = cool then humidity =normal
 If the right hand side of a rule has only the class attribute, then the rule is a classification
rule
 Distinction: Classification rules are applied together as sets of rules

 Specific - Market-basket analysis

 Input: a collection of transactions
 Output: rules to predict the occurrence of any item(s) from the occurrence of other items
in a transaction
 E.g. {Milk, Diaper} -> {Beer}
 General rule structure:
 Antecedents -> Consequents 3
The Market-Basket Model

• A large set of items, e.g., things sold in a supermarket

• A large set of baskets, each of which is a small set of the items, e.g., the
items one customer buys on one day

4
Market-Baskets – (2)

• In general many-many mapping (association) between two kinds of items

– But we ask about connections among “items,” not “baskets”

• The technology focuses on common events, not rare events

5
EXAMPLE
Market-Basket transactions
TID Items Example of Association Rules
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Diaper}  {Beer},
3 Milk, Diaper, Beer, Coke
{Milk, Bread}  {Eggs,Coke},
4 Bread, Milk, Diaper, Beer {Beer, Bread}  {Milk},
5 Bread, Milk, Diaper, Coke
Implication means co-occurrence,
not causality!
{Diaper}->{Beer}: In a significant number of transactions where Diaper is present,
Beer is also present. But this does not mean that diapers cause people to buy beer

6
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule
 Association Rule TID Items
– An implication expression of the form X  Y, 1 Bread, Milk
where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics for X  Y,

– Support (s)
Example:
{Milk, Diaper}  Beer
Fraction of transactions that contain both
X and Y  (Milk, Diaper, Beer) 2
s   0.4
– Confidence (c) | T | 5
 Measures how often items in Y  (Milk, Diaper, Beer) 2
c   0.67
appear in transactions that  (Milk, Diaper) 3
contain X
8
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find
all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

9
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence measures
• Thus, we may decouple the support and confidence requirements
10
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still computationally expensive

11
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE Given d items, there are 2d

possible candidate
ABCDE
itemsets
12
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate

– Complexity ~ O(NMw) => Expensive since M = 2d !!!
13
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

R  3d  2 d 1  1

If d=6, R = 602 rules

14
Frequent Itemset Generation Strategies

• Reduce the number of candidates (M)

– Complete search: M=2d
– Use pruning techniques to reduce M

• Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases

• Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction

15
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle holds due to the following property of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )

– Support of an itemset never exceeds the support of its subsets

– Known as the anti-monotone property of support

16
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned ABCDE

supersets
17
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 (No need to generate
{Milk,Beer} 2
{Milk,Diaper} 3
candidates involving Coke
{Beer,Diaper} 3 or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

18
Apriori Algorithm
• Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that are
frequent

19
THE APRIORI ALGORITHM: BASIC IDEA
 Join Step: Ck is generated by joining^ Lk-1with itself
 Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
^Join of Lk and Lk requires the two joining itemsets to share k-1 items.
20
THE APRIORI ALGORITHM — EXAMPLE MinSup=2
Database D
TID Items itemset sup.
L1 itemset sup.
100 1 3 4 C1 {1} 2 {1} 2
200 2 3 5 {2} 3
Scan D {2} 3
300 1 2 3 5 {3} 3
400 2 5 {3} 3
{4} 1 {5} 3
{5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2 21
Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of each candidate
itemset
– To reduce the number of comparisons, store the candidates in a hash structure
• Instead of matching each transaction against every candidate, match it
against candidates contained in the hashed buckets

Transactions Hash Structure

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
22
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5
7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function: e.g h(p)=p mod 3
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate
itemsets exceeds max leaf size, split the node)
Hash function 234
1,4,7 3,6,9 567 367
145 136345 356
2,5,8 357 368
124 689
125 159
457
458
23
Subset Operation
Transaction, t
Given a transaction t, what
are the possible subsets of 1 2 3 5 6

size 3? Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items

24
Bottlenecks of Apriori

 Candidate generation can result in huge candidate sets:

 104 frequent 1-itemset will generate 107 candidate 2-itemsets (how?)
 To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one
needs to generate 2100 ~ 1030 candidates
 Multiple scans of database:
 Needs O(n) scans, n is the length of the longest pattern (?)

25
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may
also increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with
number of transactions
• Average transaction width
– transaction width increases with denser data sets
– this may increase max length of frequent itemsets and traversals of hash tree
26
(number of subsets in a transaction increases with its width)
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its immediate supersets is
frequent null

Maximal A B C D E

Itemsets
AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets
ABCD
E Border
27
Closed Itemset
• An itemset is closed if none of its immediate supersets has the same support as the
itemset

Itemset Support
TID Items {A} 4
Itemset Support
1 {A,B} {B} 5
{A,B,C} 2
2 {B,C,D} {C} 3
{A,B,D} 3
3 {A,B,C,D} {D} 4
{A,C,D} 2
4 {A,B,D} {A,B} 4
{B,C,D} 3
5 {A,B,C,D} {A,C} 2
{A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3

28
Maximal vs Closed Itemsets
TID Items null Transaction
1 ABC
124 123 1234 245
Ids 345
2 ABCD A B C D E

3 BCE
4 ACDE 12 124 24 123
4 2 3 24 34 45
5 DE AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE
29
Maximal vs Closed Frequent Itemsets Closed
but not
Minimum support = 2 maximal
null

124 123 1234 245 345

A B C D E

Closed
12 124 24 4 123 2
and
maximal
3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4

# Closed = 9
ABCD ABCE ABDE ACDE BCDE

ABCDE # Maximal = 4
30
Compact Representation of Frequent Itemsets

• Some itemsets are redundant because they have identical support as their
supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

10 
 3   
10

• Number of frequent itemsets

k 
k 1

• Need a compact representation

31
Maximal vs Closed Itemsets

32
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general

Frequent
itemset Frequent
border null null itemset null
border

.. .. ..
.. .. ..
Frequent
{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional

33
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– Equivalent Classes (two itemsets belong to the same class if they share
same common prefix or suffix)
null null

A B C D A B C D

AB AC AD BC BD CD AB AC BC AD BD CD

ABC ABD ACD BCD ABC ABD ACD BCD

ABCD ABCD

(a) Prefix tree (b) Suffix tree

34
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– Breadth-first vs Depth-first
– Apriori traverses in BFS manner
– DFS quickly finds maximal frequent set

(a) Breadth first (b) Depth first

35
ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION
 ECLAT: for each item, store a list of transaction ids (tids); vertical data layout

Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
TID-list 36
ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION
 Determine support of any k-itemset by intersecting tid-lists of two of its (k-1)
subsets.
A B AB
1 1 1
2 5

 
4
5 5 7
6 7 8
7 8
8 10
9

 Advantage: very fast support counting

 Disadvantage: intermediate tid-lists may become too large for memory

37
Transactions Reductions
• Prune irrelevant transactions early
• If a transaction does not contain any frequent k-itemsets, it cannot help in
generating any frequent (k+1)-itemsets.
• This pruning is done after each iteration:
– After processing frequent k-itemsets, remove transactions that contain no
frequent k-itemsets
– Repeat the process for (k+1), (k+2), etc.

38
FP GROWTH ALGORITHM
 Apriori: uses a generate-and-test approach – generates candidate itemsets
and tests if they are frequent
 Generation of candidate itemsets is expensive(in both space and time)
 Support counting is expensive
 Subset checking (computationally expensive)

 Multiple Database scans (I/O)

 FP-Growth: allows frequent itemset discovery without candidate itemset

generation. Two step approach:
 Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.

 Step 2: Extracts frequent itemsets directly from the FP-tree

39
Mining Frequent Patterns Without Candidate Generation

• Compress a large database into a compact, Frequent-Pattern tree (FP-

tree) structure
– Highly condensed, but complete for frequent pattern mining
– Avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology:
• Compress DB into FP-tree, retain itemset associations
• Divide the new DB into a set of conditional DBs – each associated with one
frequent item
• Mine each such database seperately
– Avoid candidate generation

40
PHASE 1: FP-TREE CONSTRUCTION

 FP-Tree is constructed using 2 passes over the data-set:

Pass 1:
 Scan data and find support for each item
 Discard infrequent items
 Sort frequent items in decreasing order based on their support
Use this order when building the FP-Tree, so common prefixes can be
shared.

41
STEP 1: FP-TREE CONSTRUCTION

Pass 2:
Nodes correspond to items and have a counter
 FP-Growth reads 1 transaction at a time and maps it to a path
 Fixed order is used, so paths can overlap when transactions share items
(when they have the same prefix ).
 In this case, counters are incremented
 Pointers are maintained between nodes containing the same item,
creating singly linked lists (dotted lines)
 The more paths that overlap, the higher the compression. FP-tree may
fit in memory.

42
FP-Growth Method : An Example
TID List of Items • Consider the example of a database D,
T1 I1, I2, I5 consisting of 9 transactions.
T2 I2, I4
• Suppose min. support count required is 2
(i.e. min_sup = 2/9 = 22 % )
T3 I2, I3
• The first scan of the database is same as
T4 I1, I2, I4 Apriori, which derives the set of 1-itemsets
T5 I1, I3
& their support counts.
• The set of frequent items is sorted in the
T6 I2, I3
order of descending support count.
T7 I1, I3
• The resulting set is denoted as L = {I2:7,
T8 I1, I2 ,I3, I5 I1:6, I3:6, I4:2, I5:2}
T9 I1, I2, I3

43
Support Count and Sorted Items
Items Support TID Items Sorted Items
I1 6 T1 I1, I2, I5 I2, I1, I5
I2 7 T2 I2, I4 I2,I4
I3 6 T3 I2, I3 I2,I3
I4 2 T4 I1,I2,I4 I2,I1,I4
I5 2 T5 I1,I3 I1,I3
T6 I2,I3 I2,I3
T7 I1,I3 I1,I3
T8 I1,I2,I3,I5 I2,I1,I3,I5
T9 I1,I2,I3 I2,I1,I3
44
FP-Growth Method: Construction of FP-Tree
• First, create the root of the tree, labeled with “null”.
• Scan the database D a second time (First time we scanned it to create 1-itemset
and then L), this will generate the complete tree.
• The items in each transaction are processed in L order (i.e. sorted order).
• A branch is created for each transaction with items having their support count
separated by colon.
• Whenever the same node is encountered in another transaction, we just
increment the support count of the common node or Prefix.
• To facilitate tree traversal, an item header table is built so that each item points to
its occurrences in the tree via a chain of node-links.
• Now, The problem of mining frequent patterns in database is transformed to that
of mining the FP-Tree.
45
FP-Growth Method: Construction of FP-Tree
null{}
Item Sup Node-
Id Count link I2:7 I1:2
I2 7
I1 6 I1:4 I4:1
I3:2
I3 6
I4 2 I3:2
I5 2
I3:2 I4:1
I5:1
I5:1
An FP-Tree that registers compressed, frequent pattern information

46
Mining the FP-Tree by Creating Conditional (sub) pattern
bases
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in
the FP-Tree co-occurring with suffix pattern.
3. Then, construct its conditional FP-Tree & perform mining on this tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the
frequent patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required
frequent itemset.

47
FP-Tree Example Continued
Item Conditional pattern base Conditional Frequent pattern generated
FP-Tree
I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2

I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2

I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 2 , I2 I1 I3: 2

I1 {(I2: 4)} <I2: 4> I2 I1: 4

Mining the FP-Tree by creating conditional (sub) pattern bases

Now, following the above mentioned steps:
 Lets start from I5. I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
 Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2 I1: 1}
and {I2 I1 I3: 1}, which forms its conditional pattern base.
48
FP-Tree Example Continued
• Out of these, only I1 & I2 is selected in the conditional FP-Tree because I3 does
not satisfy the minimum support count.
For I1, support count in conditional pattern base = 1 + 1 = 2
For I2, support count in conditional pattern base = 1 + 1 = 2
For I3, support count in conditional pattern base = 1
Thus support count for I3 is less than required min_sup which is 2 here.
• Now, we have a conditional FP-Tree with us.
• All frequent pattern corresponding to suffix I5 are generated by considering all
possible combinations of I5 and conditional FP-Tree.
• The same procedure is applied to suffixes I4, I3 and I1.
• Note: I2 is not taken into consideration for suffix because it doesn’t have any
prefix at all.

49
Why Frequent Pattern Growth Fast ?
• Performance study shows
– FP-growth is an order of magnitude faster than Apriori
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scans
– Basic operation is counting and FP-tree building

50
FP-TREE SIZE
 The FP-Tree usually has a smaller size than the uncompressed data -
typically many transactions share items (and hence prefixes).
 Best case scenario: all transactions contain the same set of items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of items (no
items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.

 The size of the FP-tree depends on how the items are ordered
 Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).

51
ADVANTAGES AND DISADVANTAGES

 Advantages of FP-Growth
 only 2 passes over data-set
 “compresses” data-set
 no candidate generation
 much faster than Apriori
 Disadvantages of FP-Growth
 FP-Tree may not fit in memory!!
 FP-Tree is expensive to build

52
RULE GENERATION
 Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f
satisfies the minimum confidence requirement
 If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB,

 If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and 

 L)

53
RULE GENERATION
 How to efficiently generate rules from frequent itemsets?
 In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same itemset has an anti-monotone
property
 e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)

 Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

If an itemset S satisfies an anti-monotone constraint C, then all of its subsets also
satisfy C (i.e., C is downward closed).
54
RULE GENERATION FOR APRIORI ALGORITHM
 Candidate rule is generated by merging two rules that share the same prefix
in the rule consequent
CD=>AB BD=>AC

 join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

 Prune rule D=>ABC if its D=>ABC

subset AD=>BC does not have
high confidence (i.e. confidence below threshold)

55
COMPUTING INTERESTINGNESS MEASURE
 Given a rule X  Y, information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table for X  Y
Y Y
f11: support of 𝑋 and 𝑌
X f11 f10 f1+
X f01 f00 fo+
f10: support of 𝑋 and 𝑌
f+1 f+0 |T| f01: support of 𝑋 and 𝑌
f00: support of 𝑋 and 𝑌
Used to define various measures
 support,confidence, lift, Gini,
J-measure, etc.

56
DRAWBACK OF CONFIDENCE

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence (Tea  Coffee) = 0.75 and Support (Tea  Coffee) = 0.15

but Support(Coffee) = 0.9
 Although confidence is high, rule is misleading
Fraction of tea drinkers who drink coffee is actually less than the overall fraction
of people who actually drink coffee 57
STATISTICAL-BASED MEASURES
 Measures that take into account statistical dependence
𝑋→𝑌

𝐿𝑖𝑓𝑡 = 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋→𝑌)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑌)

Lift< 1: Negatively associated

Lift=1: No dependence
Lift>1: Positively associated

58
EXAMPLE: LIFT/INTEREST

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence (Tea  Coffee) = 0.75

but Support(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

Siebel CRM Interview Questions and Answers
No ratings yet
Siebel CRM Interview Questions and Answers
2 pages
Slides
No ratings yet
Slides
92 pages
Association Rule Mining
No ratings yet
Association Rule Mining
97 pages
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
No ratings yet
Association Rule Mining: - Algorithms For Frequent Itemset Mining - Apriori - Elcat - FP-Growth
45 pages
DM Association
No ratings yet
DM Association
43 pages
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
No ratings yet
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
14 pages
06 FPBasic
No ratings yet
06 FPBasic
103 pages
06FPBasic
No ratings yet
06FPBasic
77 pages
Unit 2
No ratings yet
Unit 2
14 pages
BD25
No ratings yet
BD25
19 pages
New Microsoft Power Point Presentation
No ratings yet
New Microsoft Power Point Presentation
18 pages
Association
No ratings yet
Association
54 pages
Lect 6
No ratings yet
Lect 6
74 pages
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
No ratings yet
CA03CA3405Notes On Association Rule Mining and Apriori Algorithm
41 pages
1.2 Association Rule Mining: Abdulfetah Abdulahi A
No ratings yet
1.2 Association Rule Mining: Abdulfetah Abdulahi A
43 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
ch6 PDF
No ratings yet
ch6 PDF
82 pages
Rule Mining
No ratings yet
Rule Mining
20 pages
Rule Mining by Akshay Rele
No ratings yet
Rule Mining by Akshay Rele
42 pages
association rule
No ratings yet
association rule
22 pages
Association Rules PDF
No ratings yet
Association Rules PDF
35 pages
Data Mining Task - Association Rule Mining
No ratings yet
Data Mining Task - Association Rule Mining
30 pages
DM -Unit 2-PPT
No ratings yet
DM -Unit 2-PPT
49 pages
Data Mining: Frequent Itemsets and Association Rules
No ratings yet
Data Mining: Frequent Itemsets and Association Rules
105 pages
Dm&bi - L10-Association Rules
No ratings yet
Dm&bi - L10-Association Rules
43 pages
Association Rules & Frequent Itemsets: The Market-Basket Problem
No ratings yet
Association Rules & Frequent Itemsets: The Market-Basket Problem
5 pages
Datamining Lect2 Frequent
No ratings yet
Datamining Lect2 Frequent
59 pages
MS (Data Science) Fall 2020 Semester
No ratings yet
MS (Data Science) Fall 2020 Semester
36 pages
Association Rule
No ratings yet
Association Rule
17 pages
BITS WASE Data Mining Session 5 PDF
No ratings yet
BITS WASE Data Mining Session 5 PDF
83 pages
Association: Market Basket Analysis
No ratings yet
Association: Market Basket Analysis
40 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
dmunit2
No ratings yet
dmunit2
85 pages
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6: by Tan, Steinbach, Kumar
65 pages
04 Frequent Patterns Analysis
No ratings yet
04 Frequent Patterns Analysis
37 pages
6 - Association Rules- for students
No ratings yet
6 - Association Rules- for students
39 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
UNIT 4 .3 ASSOCIATION ANALYSIS
No ratings yet
UNIT 4 .3 ASSOCIATION ANALYSIS
50 pages
Chap6 Basic Association Analysis
No ratings yet
Chap6 Basic Association Analysis
82 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
dataanalytics unit-4
No ratings yet
dataanalytics unit-4
23 pages
Arm PPT
No ratings yet
Arm PPT
15 pages
P-3 1 5-Association
No ratings yet
P-3 1 5-Association
46 pages
Module1 Part2
No ratings yet
Module1 Part2
17 pages
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 6 Introduction To Data Mining: by Tan, Steinbach, Kumar
82 pages
DS2 Association
No ratings yet
DS2 Association
48 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
54 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
ASSOCIATION ANALYSIS
No ratings yet
ASSOCIATION ANALYSIS
26 pages
Week 6 - Basic Association Analysis
No ratings yet
Week 6 - Basic Association Analysis
71 pages
Data Mining mod 2
No ratings yet
Data Mining mod 2
7 pages
DWDM Unit-3
No ratings yet
DWDM Unit-3
35 pages
Association Rule Mining
No ratings yet
Association Rule Mining
92 pages
Association Rule Mining Spring 2022
No ratings yet
Association Rule Mining Spring 2022
84 pages
class 4-Associative Analysis
No ratings yet
class 4-Associative Analysis
42 pages
Unit 5
No ratings yet
Unit 5
40 pages
Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
No ratings yet
Association Analysis: Basic Concepts and Algorithms: Market-Basket Transactions
42 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-2: Tid Items
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-2: Tid Items
4 pages
AprioriTID Algorithm Improved From Apriori Algorithm
No ratings yet
AprioriTID Algorithm Improved From Apriori Algorithm
5 pages
CAB Question Paper
No ratings yet
CAB Question Paper
4 pages
Oral Questions LP-II: Star Schema
No ratings yet
Oral Questions LP-II: Star Schema
21 pages
Tybsc-It Sem5 Awp Apr19
No ratings yet
Tybsc-It Sem5 Awp Apr19
1 page
PHP MySQL CRUD Tutorial With MySqli and PHPMyAdmin
No ratings yet
PHP MySQL CRUD Tutorial With MySqli and PHPMyAdmin
26 pages
Shruthi Gujjar - Main Frames Developer
No ratings yet
Shruthi Gujjar - Main Frames Developer
4 pages
Business Intelligence PDF
No ratings yet
Business Intelligence PDF
215 pages
Azure Interview Questions and Answers-2
No ratings yet
Azure Interview Questions and Answers-2
20 pages
Mongodb Architecture Guide
No ratings yet
Mongodb Architecture Guide
13 pages
SPAU_SPDD_1730716809
No ratings yet
SPAU_SPDD_1730716809
5 pages
CIE 1 DBMS Lab
No ratings yet
CIE 1 DBMS Lab
13 pages
Correct Db2 Final Answer List 1 Q &amp A
No ratings yet
Correct Db2 Final Answer List 1 Q &amp A
17 pages
Chapter 05-2 ER-EER To Relational Mapping
No ratings yet
Chapter 05-2 ER-EER To Relational Mapping
27 pages
Sales Tracking System Tables
No ratings yet
Sales Tracking System Tables
21 pages
Virtual Private Databases
No ratings yet
Virtual Private Databases
7 pages
Mock Paper 2
No ratings yet
Mock Paper 2
11 pages
Introduction To Database Management System (CAS 372)
No ratings yet
Introduction To Database Management System (CAS 372)
19 pages
Cs301-Midterm Subjective Solved With Refrences by Moaaz & Awais
No ratings yet
Cs301-Midterm Subjective Solved With Refrences by Moaaz & Awais
9 pages
표준프레임워크 보안개발 가이드
No ratings yet
표준프레임워크 보안개발 가이드
126 pages
Module 4 Quiz - Coursera
No ratings yet
Module 4 Quiz - Coursera
1 page
Series (Working With Data) - ApexCharts - Js
No ratings yet
Series (Working With Data) - ApexCharts - Js
4 pages
Scripts Upload Scripts
No ratings yet
Scripts Upload Scripts
5 pages
Archiving PP Boms (PP BD Bom)
No ratings yet
Archiving PP Boms (PP BD Bom)
8 pages
SBSF-datasheet
No ratings yet
SBSF-datasheet
4 pages
Unity Catalog
No ratings yet
Unity Catalog
15 pages
06.discretization Problem Statement
50% (2)
06.discretization Problem Statement
2 pages
FUNCTIONAL SPECIFICATION - Employee Information - Oracle Integration V 2.1
No ratings yet
FUNCTIONAL SPECIFICATION - Employee Information - Oracle Integration V 2.1
10 pages
Form 4 - T3
No ratings yet
Form 4 - T3
7 pages
CH 5 Turban IT
No ratings yet
CH 5 Turban IT
31 pages
Data Dictionary: For Field Name: Username
No ratings yet
Data Dictionary: For Field Name: Username
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CS2202_AssociationRuleMining

Uploaded by

CS2202_AssociationRuleMining

Uploaded by

Association Rule Mining

Reference: “Introduction to Data Mining“ by Tan, Steinbach, Karpatne, Kumar

 Specific - Market-basket analysis

• A large set of items, e.g., things sold in a supermarket

• In general many-many mapping (association) between two kinds of items

• The technology focuses on common events, not rare events

 Rule Evaluation Metrics for X  Y,

• Frequent itemset generation is still computationally expensive

ABCD ABCE ABDE ACDE BCDE Given d items, there are 2d

– Match each transaction against every candidate

If d=6, R = 602 rules

• Reduce the number of candidates (M)

• Reduce the number of transactions (N)

• Reduce the number of comparisons (NM)

– Support of an itemset never exceeds the support of its subsets

ABCD ABCE ABDE ACDE BCDE

Transactions Hash Structure

Level 3 Subsets of 3 items

 Candidate generation can result in huge candidate sets:

ABCD ABCE ABDE ACDE BCDE

124 123 1234 245 345

• Number of frequent itemsets

• Need a compact representation

ABC ABD ACD BCD ABC ABD ACD BCD

(a) Prefix tree (b) Suffix tree

(a) Breadth first (b) Depth first

 Advantage: very fast support counting

 Multiple Database scans (I/O)

 FP-Growth: allows frequent itemset discovery without candidate itemset

 Step 2: Extracts frequent itemsets directly from the FP-tree

• Compress a large database into a compact, Frequent-Pattern tree (FP-

 FP-Tree is constructed using 2 passes over the data-set:

I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2

I1 {(I2: 4)} <I2: 4> I2 I1: 4

Mining the FP-Tree by creating conditional (sub) pattern bases

 If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and 

c(ABC  D)  c(AB  CD)  c(A  BCD)

 Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

 Prune rule D=>ABC if its D=>ABC

Association Rule: Tea  Coffee

Confidence (Tea  Coffee) = 0.75 and Support (Tea  Coffee) = 0.15

Lift< 1: Negatively associated

Association Rule: Tea  Coffee

Confidence (Tea  Coffee) = 0.75

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.