0% found this document useful (0 votes)
10 views

CS2202_AssociationRuleMining

Association Rule Mining involves finding rules that predict the occurrence of items based on transactions, particularly in market-basket analysis. The process includes generating frequent itemsets and evaluating rules based on support and confidence thresholds. The Apriori algorithm is a common method used to efficiently generate these itemsets while reducing computational complexity.

Uploaded by

Sparsh Rastogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

CS2202_AssociationRuleMining

Association Rule Mining involves finding rules that predict the occurrence of items based on transactions, particularly in market-basket analysis. The process includes generating frequent itemsets and evaluating rules based on support and confidence thresholds. The Apriori algorithm is a common method used to efficiently generate these itemsets while reducing computational complexity.

Uploaded by

Sparsh Rastogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Association Rule Mining

CS2202

1
ASSOCIATION RULE MINING

 Given a set of transactions, find rules that will predict the occurrence of an item
based on the occurrences of other items in the transaction

Reference: “Introduction to Data Mining“ by Tan, Steinbach, Karpatne, Kumar

2
THE TASK
 Two ways of defining the task
 General
 Input: A collection of instances
 Output: rules to predict the values of any attribute(s) (not just the class attribute) from
values of other attributes
 E.g. if temperature = cool then humidity =normal
 If the right hand side of a rule has only the class attribute, then the rule is a classification
rule
 Distinction: Classification rules are applied together as sets of rules

 Specific - Market-basket analysis


 Input: a collection of transactions
 Output: rules to predict the occurrence of any item(s) from the occurrence of other items
in a transaction
 E.g. {Milk, Diaper} -> {Beer}
 General rule structure:
 Antecedents -> Consequents 3
The Market-Basket Model

• A large set of items, e.g., things sold in a supermarket

• A large set of baskets, each of which is a small set of the items, e.g., the
items one customer buys on one day

4
Market-Baskets – (2)

• In general many-many mapping (association) between two kinds of items


– But we ask about connections among “items,” not “baskets”

• The technology focuses on common events, not rare events

5
EXAMPLE
Market-Basket transactions
TID Items Example of Association Rules
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Diaper}  {Beer},
3 Milk, Diaper, Beer, Coke
{Milk, Bread}  {Eggs,Coke},
4 Bread, Milk, Diaper, Beer {Beer, Bread}  {Milk},
5 Bread, Milk, Diaper, Coke
Implication means co-occurrence,
not causality!
{Diaper}->{Beer}: In a significant number of transactions where Diaper is present,
Beer is also present. But this does not mean that diapers cause people to buy beer

6
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset TID Items
• An itemset that contains k items 1 Bread, Milk
• Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule
 Association Rule TID Items
– An implication expression of the form X  Y, 1 Bread, Milk
where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics for X  Y,


– Support (s)
Example:
{Milk, Diaper}  Beer
Fraction of transactions that contain both
X and Y  (Milk, Diaper, Beer) 2
s   0.4
– Confidence (c) | T | 5
 Measures how often items in Y  (Milk, Diaper, Beer) 2
c   0.67
appear in transactions that  (Milk, Diaper) 3
contain X
8
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find
all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

9
Mining Association Rules
Example of Rules:
TID Items
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence measures
• Thus, we may decouple the support and confidence requirements
10
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

• Frequent itemset generation is still computationally expensive

11
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE Given d items, there are 2d


possible candidate
ABCDE
itemsets
12
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!!
13
Computational Complexity
• Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

R  3d  2 d 1  1

If d=6, R = 602 rules

14
Frequent Itemset Generation Strategies

• Reduce the number of candidates (M)


– Complete search: M=2d
– Use pruning techniques to reduce M

• Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases

• Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction

15
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle holds due to the following property of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )

– Support of an itemset never exceeds the support of its subsets


– Known as the anti-monotone property of support

16
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned ABCDE

supersets
17
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2
Eggs 1
{Bread,Diaper} 3 (No need to generate
{Milk,Beer} 2
{Milk,Diaper} 3
candidates involving Coke
{Beer,Diaper} 3 or Eggs)
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6C + 6C + 6C = 41 {Bread,Milk,Diaper} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

18
Apriori Algorithm
• Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that are
frequent

19
THE APRIORI ALGORITHM: BASIC IDEA
 Join Step: Ck is generated by joining^ Lk-1with itself
 Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
^Join of Lk and Lk requires the two joining itemsets to share k-1 items.
20
THE APRIORI ALGORITHM — EXAMPLE MinSup=2
Database D
TID Items itemset sup.
L1 itemset sup.
100 1 3 4 C1 {1} 2 {1} 2
200 2 3 5 {2} 3
Scan D {2} 3
300 1 2 3 5 {3} 3
400 2 5 {3} 3
{4} 1 {5} 3
{5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2 21
Reducing Number of Comparisons
• Candidate counting:
– Scan the database of transactions to determine the support of each candidate
itemset
– To reduce the number of comparisons, store the candidates in a hash structure
• Instead of matching each transaction against every candidate, match it
against candidates contained in the hashed buckets

Transactions Hash Structure


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
22
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5
7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function: e.g h(p)=p mod 3
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate
itemsets exceeds max leaf size, split the node)
Hash function 234
1,4,7 3,6,9 567 367
145 136345 356
2,5,8 357 368
124 689
125 159
457
458
23
Subset Operation
Transaction, t
Given a transaction t, what
are the possible subsets of 1 2 3 5 6

size 3? Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items


24
Bottlenecks of Apriori

 Candidate generation can result in huge candidate sets:


 104 frequent 1-itemset will generate 107 candidate 2-itemsets (how?)
 To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one
needs to generate 2100 ~ 1030 candidates
 Multiple scans of database:
 Needs O(n) scans, n is the length of the longest pattern (?)

25
Factors Affecting Complexity
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may
also increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with
number of transactions
• Average transaction width
– transaction width increases with denser data sets
– this may increase max length of frequent itemsets and traversals of hash tree
26
(number of subsets in a transaction increases with its width)
Maximal Frequent Itemset
An itemset is maximal frequent if it is frequent and none of its immediate supersets is
frequent null

Maximal A B C D E

Itemsets
AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets
ABCD
E Border
27
Closed Itemset
• An itemset is closed if none of its immediate supersets has the same support as the
itemset

Itemset Support
TID Items {A} 4
Itemset Support
1 {A,B} {B} 5
{A,B,C} 2
2 {B,C,D} {C} 3
{A,B,D} 3
3 {A,B,C,D} {D} 4
{A,C,D} 2
4 {A,B,D} {A,B} 4
{B,C,D} 3
5 {A,B,C,D} {A,C} 2
{A,B,C,D} 2
{A,D} 3
{B,C} 3
{B,D} 4
{C,D} 3

28
Maximal vs Closed Itemsets
TID Items null Transaction
1 ABC
124 123 1234 245
Ids 345
2 ABCD A B C D E

3 BCE
4 ACDE 12 124 24 123
4 2 3 24 34 45
5 DE AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE
29
Maximal vs Closed Frequent Itemsets Closed
but not
Minimum support = 2 maximal
null

124 123 1234 245 345


A B C D E

Closed
12 124 24 4 123 2
and
maximal
3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4

# Closed = 9
ABCD ABCE ABDE ACDE BCDE

ABCDE # Maximal = 4
30
Compact Representation of Frequent Itemsets

• Some itemsets are redundant because they have identical support as their
supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

10 
 3   
10

• Number of frequent itemsets


k 
k 1

• Need a compact representation


31
Maximal vs Closed Itemsets

32
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general

Frequent
itemset Frequent
border null null itemset null
border

.. .. ..
.. .. ..
Frequent
{a1,a2,...,an} {a1,a2,...,an} itemset {a1,a2,...,an}
border
(a) General-to-specific (b) Specific-to-general (c) Bidirectional

33
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– Equivalent Classes (two itemsets belong to the same class if they share
same common prefix or suffix)
null null

A B C D A B C D

AB AC AD BC BD CD AB AC BC AD BD CD

ABC ABD ACD BCD ABC ABD ACD BCD

ABCD ABCD

(a) Prefix tree (b) Suffix tree

34
Alternative Methods for Frequent Itemset Generation
• Traversal of Itemset Lattice
– Breadth-first vs Depth-first
– Apriori traverses in BFS manner
– DFS quickly finds maximal frequent set

(a) Breadth first (b) Depth first

35
ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION
 ECLAT: for each item, store a list of transaction ids (tids); vertical data layout

Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
TID-list 36
ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION
 Determine support of any k-itemset by intersecting tid-lists of two of its (k-1)
subsets.
A B AB
1 1 1
2 5

 
4
5 5 7
6 7 8
7 8
8 10
9

 Advantage: very fast support counting


 Disadvantage: intermediate tid-lists may become too large for memory

37
Transactions Reductions
• Prune irrelevant transactions early
• If a transaction does not contain any frequent k-itemsets, it cannot help in
generating any frequent (k+1)-itemsets.
• This pruning is done after each iteration:
– After processing frequent k-itemsets, remove transactions that contain no
frequent k-itemsets
– Repeat the process for (k+1), (k+2), etc.

38
FP GROWTH ALGORITHM
 Apriori: uses a generate-and-test approach – generates candidate itemsets
and tests if they are frequent
 Generation of candidate itemsets is expensive(in both space and time)
 Support counting is expensive
 Subset checking (computationally expensive)

 Multiple Database scans (I/O)

 FP-Growth: allows frequent itemset discovery without candidate itemset


generation. Two step approach:
 Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.

 Step 2: Extracts frequent itemsets directly from the FP-tree

39
Mining Frequent Patterns Without Candidate Generation

• Compress a large database into a compact, Frequent-Pattern tree (FP-


tree) structure
– Highly condensed, but complete for frequent pattern mining
– Avoid costly database scans
• Develop an efficient, FP-tree-based frequent pattern mining method
– A divide-and-conquer methodology:
• Compress DB into FP-tree, retain itemset associations
• Divide the new DB into a set of conditional DBs – each associated with one
frequent item
• Mine each such database seperately
– Avoid candidate generation

40
PHASE 1: FP-TREE CONSTRUCTION

 FP-Tree is constructed using 2 passes over the data-set:


Pass 1:
 Scan data and find support for each item
 Discard infrequent items
 Sort frequent items in decreasing order based on their support
Use this order when building the FP-Tree, so common prefixes can be
shared.

41
STEP 1: FP-TREE CONSTRUCTION

Pass 2:
Nodes correspond to items and have a counter
 FP-Growth reads 1 transaction at a time and maps it to a path
 Fixed order is used, so paths can overlap when transactions share items
(when they have the same prefix ).
 In this case, counters are incremented
 Pointers are maintained between nodes containing the same item,
creating singly linked lists (dotted lines)
 The more paths that overlap, the higher the compression. FP-tree may
fit in memory.

42
FP-Growth Method : An Example
TID List of Items • Consider the example of a database D,
T1 I1, I2, I5 consisting of 9 transactions.
T2 I2, I4
• Suppose min. support count required is 2
(i.e. min_sup = 2/9 = 22 % )
T3 I2, I3
• The first scan of the database is same as
T4 I1, I2, I4 Apriori, which derives the set of 1-itemsets
T5 I1, I3
& their support counts.
• The set of frequent items is sorted in the
T6 I2, I3
order of descending support count.
T7 I1, I3
• The resulting set is denoted as L = {I2:7,
T8 I1, I2 ,I3, I5 I1:6, I3:6, I4:2, I5:2}
T9 I1, I2, I3

43
Support Count and Sorted Items
Items Support TID Items Sorted Items
I1 6 T1 I1, I2, I5 I2, I1, I5
I2 7 T2 I2, I4 I2,I4
I3 6 T3 I2, I3 I2,I3
I4 2 T4 I1,I2,I4 I2,I1,I4
I5 2 T5 I1,I3 I1,I3
T6 I2,I3 I2,I3
T7 I1,I3 I1,I3
T8 I1,I2,I3,I5 I2,I1,I3,I5
T9 I1,I2,I3 I2,I1,I3
44
FP-Growth Method: Construction of FP-Tree
• First, create the root of the tree, labeled with “null”.
• Scan the database D a second time (First time we scanned it to create 1-itemset
and then L), this will generate the complete tree.
• The items in each transaction are processed in L order (i.e. sorted order).
• A branch is created for each transaction with items having their support count
separated by colon.
• Whenever the same node is encountered in another transaction, we just
increment the support count of the common node or Prefix.
• To facilitate tree traversal, an item header table is built so that each item points to
its occurrences in the tree via a chain of node-links.
• Now, The problem of mining frequent patterns in database is transformed to that
of mining the FP-Tree.
45
FP-Growth Method: Construction of FP-Tree
null{}
Item Sup Node-
Id Count link I2:7 I1:2
I2 7
I1 6 I1:4 I4:1
I3:2
I3 6
I4 2 I3:2
I5 2
I3:2 I4:1
I5:1
I5:1
An FP-Tree that registers compressed, frequent pattern information

46
Mining the FP-Tree by Creating Conditional (sub) pattern
bases
1. Start from each frequent length-1 pattern (as an initial suffix pattern).
2. Construct its conditional pattern base which consists of the set of prefix paths in
the FP-Tree co-occurring with suffix pattern.
3. Then, construct its conditional FP-Tree & perform mining on this tree.
4. The pattern growth is achieved by concatenation of the suffix pattern with the
frequent patterns generated from a conditional FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the required
frequent itemset.

47
FP-Tree Example Continued
Item Conditional pattern base Conditional Frequent pattern generated
FP-Tree
I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2

I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2

I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 2 , I2 I1 I3: 2

I1 {(I2: 4)} <I2: 4> I2 I1: 4

Mining the FP-Tree by creating conditional (sub) pattern bases


Now, following the above mentioned steps:
 Lets start from I5. I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
 Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2 I1: 1}
and {I2 I1 I3: 1}, which forms its conditional pattern base.
48
FP-Tree Example Continued
• Out of these, only I1 & I2 is selected in the conditional FP-Tree because I3 does
not satisfy the minimum support count.
For I1, support count in conditional pattern base = 1 + 1 = 2
For I2, support count in conditional pattern base = 1 + 1 = 2
For I3, support count in conditional pattern base = 1
Thus support count for I3 is less than required min_sup which is 2 here.
• Now, we have a conditional FP-Tree with us.
• All frequent pattern corresponding to suffix I5 are generated by considering all
possible combinations of I5 and conditional FP-Tree.
• The same procedure is applied to suffixes I4, I3 and I1.
• Note: I2 is not taken into consideration for suffix because it doesn’t have any
prefix at all.

49
Why Frequent Pattern Growth Fast ?
• Performance study shows
– FP-growth is an order of magnitude faster than Apriori
• Reasoning
– No candidate generation, no candidate test
– Use compact data structure
– Eliminate repeated database scans
– Basic operation is counting and FP-tree building

50
FP-TREE SIZE
 The FP-Tree usually has a smaller size than the uncompressed data -
typically many transactions share items (and hence prefixes).
 Best case scenario: all transactions contain the same set of items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of items (no
items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.

 The size of the FP-tree depends on how the items are ordered
 Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).

51
ADVANTAGES AND DISADVANTAGES

 Advantages of FP-Growth
 only 2 passes over data-set
 “compresses” data-set
 no candidate generation
 much faster than Apriori
 Disadvantages of FP-Growth
 FP-Tree may not fit in memory!!
 FP-Tree is expensive to build

52
RULE GENERATION
 Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f
satisfies the minimum confidence requirement
 If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB,

 If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and 


 L)

53
RULE GENERATION
 How to efficiently generate rules from frequent itemsets?
 In general, confidence does not have an anti-monotone property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same itemset has an anti-monotone
property
 e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)

 Confidence is anti-monotone w.r.t. number of items on the RHS of the rule


If an itemset S satisfies an anti-monotone constraint C, then all of its subsets also
satisfy C (i.e., C is downward closed).
54
RULE GENERATION FOR APRIORI ALGORITHM
 Candidate rule is generated by merging two rules that share the same prefix
in the rule consequent
CD=>AB BD=>AC

 join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

 Prune rule D=>ABC if its D=>ABC


subset AD=>BC does not have
high confidence (i.e. confidence below threshold)

55
COMPUTING INTERESTINGNESS MEASURE
 Given a rule X  Y, information needed to compute rule interestingness can be
obtained from a contingency table
Contingency table for X  Y
Y Y
f11: support of 𝑋 and 𝑌
X f11 f10 f1+
X f01 f00 fo+
f10: support of 𝑋 and 𝑌
f+1 f+0 |T| f01: support of 𝑋 and 𝑌
f00: support of 𝑋 and 𝑌
Used to define various measures
 support,confidence, lift, Gini,
J-measure, etc.

56
DRAWBACK OF CONFIDENCE

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence (Tea  Coffee) = 0.75 and Support (Tea  Coffee) = 0.15


but Support(Coffee) = 0.9
 Although confidence is high, rule is misleading
Fraction of tea drinkers who drink coffee is actually less than the overall fraction
of people who actually drink coffee 57
STATISTICAL-BASED MEASURES
 Measures that take into account statistical dependence
𝑋→𝑌

𝐿𝑖𝑓𝑡 = 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝑋→𝑌)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡(𝑌)

Lift< 1: Negatively associated


Lift=1: No dependence
Lift>1: Positively associated

58
EXAMPLE: LIFT/INTEREST

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea  Coffee

Confidence (Tea  Coffee) = 0.75


but Support(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

59

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy