DM Lect7
DM Lect7
Association Rules
Lec 7
Mohammed
Taiz University
Outlines
• Basic Concepts
• Frequent Itemset Mining Methods
What Is Frequent Pattern Analysis?
• Frequent pattern
• a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and
association rule mining
• Finding frequent patters plays an essential role in mining association, correlation, and many
other interesting relationships among data.
• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click
stream) analysis, and DNA sequence analysis
Market Basket Analysis
Basic Concepts: Transactional Data
• Market basket example:
• Basket 1:{}
• Basket 2:{}
• Basket 3:{}
• ….
• Basket n:{}
• Definitions:
• An item : an article in a basket, or an attribte-value pair.
• A transaction: items purchased in a basket.
• A transactional dataset: A set of transactions.
Basic Concepts: Frequent Patterns
Transaction-id Items bought
• Itemset
10 A, B, D
• A set of one or more items
• e.g.,{A,B,D} is an itemset. 20 A, C, D
• k -itemset X = {x1, …, xk} is an itemset with k items. 30 A, D, E
• (absolute) support , or,support count of X
• Frequency or occurrence of an itemset X 40 B, E, F
• e.g., A=3 50 B, C, D, E, F
Custome Customer
• Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3 }
r buys A
• Association rules: buys
• A D (60%, 100%) both
• 60% of all transactions show that A and
D are purchased together
• 100% of the costumers who purchased
Customer
A also bought D
buys D
• D A (60%, 75%)
Basic Concepts: Association Rules
• An association rule is about prelateships between
disjoint itemsets X and (Y)
• It presents the pattern when X occur, Y also
occurs
• Strong is not
necessary interesting (proper choose f
threshold is necessary )
Association rule mining
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent
itemset
• Problem parameters:
• N = |T|: number of transactions
• d = |I|: number of (distinct) items
• w: max width of a transaction
• Number of possible itemsets?
• Solution
• Mine closed patterns and max-patterns instead
Closed Frequent Itemset
• An itemset X is closed if X isfrequent and there existsno super-pattern Y כ
X,with the same support as X
• An itemset is closed if none of its immediate supersets has tha same support
as the itemset.
• Closed pattern is a lossless compression of frequent patterns.
• Reducing the number of patterns and rules.
T ID Ite m s Ite m s e t S u p p o rt Ite m s e t S u p p o rt
1 {A , B } {A } 4 {A , B , C } 2
2 {B , C , D } {B } 5 {A , B , D } 3
{C } 3 {A , C , D } 2
3 {A , B , C , D }
{D } 4 {B , C , D } 3
4 {A , B , D }
{A , B } 4 {A , B , C , D } 2
5 {A , B , C , D }
{A , C } 2
{A , D } 3
{B , C } 3
{B , D } 4
{C , D } 3
Maximal Frequent Itemset
• An itemset X is a max-pattern if X is frequent and there existsno frequent
super-pattern Y כX.
• An itemset is maximal frequent if none of its immediate supersets is frequent.
Infrequent
Itemsets
Maximal Frequent Itemset
Maximal
Itemsets
Infrequent
Itemsets
Maximal vs Closed Itemsets
Frequent Itemset Mining Methods
• Scalable mining methods: Three major approaches
• Apriori (Agrawal & Srikant@VLDB’94)
• Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
• Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
Apriori: A Candidate Generation-and-Test
Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Test the candidates against DB
• Terminate when no frequent or candidate set can be
generated
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Database TDB Itemset sup
{A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L1 = {frequent items};
for (k = 1;Lk !=;k ++) do begin
Ck+1 = candidates generated from Lk ;
for each transactiont in database do
increment the count of all candidates inCk+1
that are contained int
Lk+1 = candidates inCk+1 with min_support
end
return k Lk ;
Important Details of Apriori
• How to generate candidates?
• Step 1: self-joiningLk
• Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
• L3 = {abc, abd, acd, ace, bcd }
• Self-joining:L3 *L3
• abcd fromabc andabd
• acde fromacd andace
• Pruning:
• acde is removed becauseade is not inL3
• C4 ={abcd }
Improving the ef ficiency of Aprior
• Bottlenecks of the Aprior Approach:
• Candidate generation and test:
• Often generate a huge number of candidates.
• It is costly to repeatedly scan the whole database.
• Is it interesting?
•
= 0.89
ANY QUESTIONS