Association Rule Mining Spring 2022
Association Rule Mining Spring 2022
Market-Basket transactions
Example of Association Rules
{Diaper} → {Cereal},
{Milk, Bread} → {Eggs,Coke},
{Cereal, Bread} → {Milk},
Definition: Frequent Itemset
🞂 Itemset
🞂 A collection of one or more items
🞂 E.g. {Milk, Bread, Diaper}
🞂 k-itemset
🞂 An itemset that contains k items
🞂 Support count (σ)
🞂 Frequency of occurrence of an itemset
🞂 E.g. σ({Milk, Bread,Diaper}) = 2
🞂 Support
🞂 Fraction of transactions that contain an itemset
🞂 E.g. s({Milk, Bread, Diaper}) = 2/5
🞂 Frequent Itemset
🞂 An itemset whose support is greater than or equal to a min support threshold
Definitions
• Association Rule
– An implication X → Y, where X and Y are
itemsets
For rule A ⇒ C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
Apriori Algorithm
🞂 Apriori property: any subset of a frequent itemset must be
frequent
🞂 if {cereal, diaper, nuts} is frequent, so is {cereal, diaper}
🞂 Every transaction having {cereal, diaper, nuts} also contains {cereal,
diaper}
AB AC AD BC BD CD
A B C
D
Apriori Algorithm
🞂 Method:
🞂 generate length (k+1) candidate itemsets from length k frequent itemsets,
and
🞂 test the candidates against DB
🞂 Performance studies show its efficiency and scalability
Illustration of the Apriori principle
Found to be
Infrequent
Infrequent supersets
Pruned
The Apriori algorithm
Ck = candidate itemsets of size k
Level-wise
Lk = frequent itemsets of size k
approach
1. k = 1, C1 = all items
2. While Ck not empty
Frequent
itemset 3. Scan the database to find which itemsets in Ck
generation
are frequent and put them into Lk
Candidate
generation
4. Use Lk to generate a collection of candidate
itemsets Ck+1 of size k+1
5. k = k+1
The Apriori Algorithm—An Example
Itemset sup
Itemset sup
Database TDB {A} 2 L1
C1 {A} 2
Tid Items {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E 1st scan {C} 3
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
L2 {A, B} 1
Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2 28
The Apriori Algorithm—An Example
L L
1 Tid sup
Itemset Items Itemset sup L
2
{A} 10 2 A, C, D {A, C} 2 3
Itemset sup
{B} 20 3 {B, C} 2 {B, C, E} 2
B, C, E
{C} 3 {B, E} 3
30 A, B, C, E {C, E} 2
{E} 3
40 B, E
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a partitioning of a frequent itemset into Left-Hand-
Side (LHS) and Right-Hand-Side (RHS)
● The empty itemset is marked with a soild box. All the 1-itemsets are
marked with dashed circles. All other itemsets are unmarked.
● Read M transactions. For each transaction, increment the respective
counters for the itemsets marked with dashes.
● If a dashed circle has a count that exceeds the support threshold, turn it
into a dashed square. If any immediate superset of it has all of its
subsets as solid or dashed squares, add new counter for it and make it
dashed circle.
● If a dashed itemset has been counted through all the transactions, make
it solid and stop counting it.
● If we are at the end of the transaction file, rewind to the beginning
● If any dashed itemsets remain, go to step 2.
Fig 3. Start of DIC algorithm
Fig 4. After M transactions
Fig 5. After 2M transactions
Fig 6. After one pass
DIC- Data structure
🞂 like the hash tree used in Apriori with a little extra
information
🞂 Every node stores
🞂 the last item in the itemset
🞂 counter, marker, its state
🞂 its branches if it is an interior node
DIC- Implication rules
🞂 conviction
🞂 more useful and intuitive measure
🞂 unlike confidence,
🞂 normalized based on both the antecedent and the consequent
🞂 unlike interest,
🞂 directional
🞂 actual implication as opposed to co-occurrence
🞂 support : P(A, B)
🞂 confidence : P(B|A) = P(A, B)/P(A)
🞂 interest : P(A, B)/P(A)P(B)
🞂 conviction : P(A)P(¬B)/P(A, ¬B)
Reducing Number of Candidates
🞂 Apriori principle:
🞂 If an itemset is frequent, then all of its subsets must also be
frequent
🞂 M. Zaki et al. New algorithms for fast discovery of association rules. in KDD’97
The Partitioning Algorithm
55
Frequent Pattern Growth (FP)
🞂 First algorithm that allows frequent pattern mining without
generating candidate sets
🞂 Grow long patterns from short ones using local frequent items
🞂 “ab” is a frequent pattern
🞂 Get all transactions having “ab”: DB|ab
🞂 “d” is a local frequent item in DB|ab 🡪 abd is a frequent
pattern
Construct FP-tree from a Transaction Database
65
FP-Growth
FP-Tree size
⮚ FP-Tree usually has a smaller size than the uncompressed
data
⮚ Typically many transactions share items (and hence prefixes).
⮚ Best case scenario: all transactions contain the same set of items.
⮚ 1 path in the FP-tree
⮚ Worst case scenario: every transaction has a unique set of items (no
items in common)
⮚ Size of the FP-tree is at least as large as the original data.
⮚ Storage requirements for the FP-tree are higher - need to store the
pointers between the nodes and the counters.
Example 2: FP-Tree Construction
Example 2: Conditional Pattern Base
Example 2
Let minSup = 2
Extract all frequent itemsets containing e.
75
Closed Patterns and Max-Patterns
🞂 Example:
🞂 {a1, …, a10} contains (101) + (102) + … + (1100) = 210 – 1 sub-patterns!
🞂 {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
76
Maximal Frequent Itemset
Closet Itemset
🞂 An itemset is closed if none of its immediate supersets has
the same support as the itemset
Maximal vs Closed Itemsets
Maximal vs Closed Itemsets
Maximal vs Closed Itemsets
MaxMiner: Mining Max-Patterns
🞂 Max-Miner algorithm
🞂 Efficiently extract only the maximal frequent itemsets
🞂 Roughly linear in the number of maximal frequent itemsets
🞂 “look ahead” , not bottom-up search
🞂 can prune all its subsets from consideration, by identifying a long
frequent itemset early on
Complete set-enumeration tree over four items
{}
1 2 3 4
🞂 Candidate group g,
🞂 head, h(g)
1,2,3 1,3,4 2,3,4 🞂 represents the itemsets enumerated by the
node
🞂 tail, t(g)
🞂 an ordered set
1,2,3,4 🞂 contains all items not in h(g) that can
potentially appear in any sub-node
🞂 ex. the node enumerating itemset {1}
⇒ h(g) = {1}, t(g) = {2, 3, 4}
{}
Max-Miner
1 2 3 4
1,2,3,4
CHARM: Mining by Exploring Vertical Data Format
🞂 Closed
🞂 CHARM, CLOSET+,COFI-CLOSED, Leap
🞂 Maximal
🞂 MaxMiner, MAFIA, GENMAX, COFI-MAX, Leap
Performance Evaluation of Algorithms
🞂 The FP-growth method was usually better than the best
implementation of the Apriori algorithm.
Questions
, I'm more than happy to provide the
Answers.
Feel free to ask anything!