Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Databases
Overview
Basic Concepts of Association Rule Mining
The Apriori Algorithm (Mining single
dimensional boolean association rules)
Methods to Improve Aprioris Efficiency
Frequent-Pattern Growth (FP-Growth)
Method
From Association Analysis to Correlation
Analysis
Summary
2
Simple Formulas:
Confidence (AB) = #tuples containing both A & B /
#tuples containing A = P(B|A) = P(A U B ) / P (A)
Support (AB) = #tuples containing both A & B/ total
number of tuples = P(A U B)
Overview
Basic Concepts of Association Rule Mining
The Apriori Algorithm (Mining single
dimensional boolean association rules)
Methods to Improve Aprioris Efficiency
Frequent-Pattern Growth (FP-Growth)
Method
From Association Analysis to Correlation
Analysis
Summary
8
10
Pseudo-code:
11
List of
Items
T100
I1, I2, I5
T100
I2, I4
T100
I2, I3
T100
I1, I2, I4
T100
I1, I3
T100
I2, I3
T100
I1, I3
T100
I1, I2 ,I3, I5
T100
I1, I2, I3
Consider a database, D ,
consisting of 9 transactions.
Suppose min. support count
required is 2 (i.e. min_sup =
2/9 = 22 % )
Let minimum confidence
required is 70%.
We have to first find out the
frequent itemset using Apriori
algorithm.
Then, Association rules will be
generated using min. support
& min. confidence.
12
Itemse
t
Sup.Count
{I1}
{I2}
{I3}
Compare candidate
support count with
minimum support
count
Itemse
t
Sup.Count
{I1}
{I2}
{I3}
{I4}
{I4}
{I5}
{I5}
C1
L1
13
Itemset
{I1, I2}
{I1, I3}
{I1, I4}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}
{I4, I5}
C2
Scan D
for count
of each
candidat
e
Itemse
t
Sup.
Count
Items
et
Sup
Count
{I1,
I2}
{I1,
I2}
{I1,
I3}
{I1,
I3}
{I1,
I4}
{I1,
I5}
{I1,
I5}
{I2,
I3}
{I2,
I3}
{I2,
I4}
{I2,
I4}
{I2,
I5}
{I2,
I5}
{I3,
I4}
C2
Compare
candidate
support
count with
minimum
support
count
L2
2
2
14
Itemset
{I1, I2, I3}
{I1, I2, I5}
C3
Scan D
for count
of each
candidat
e
Itemset
Sup.
Count
{I1, I2,
I3}
{I1, I2,
I5} C3
Compare
candidate
support
count with
min
support
count
Itemset
Sup
Coun
t
{I1, I2,
I3}
L3
{I1, I2,
I5}
C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3,
I4}, {I2, I3, I5}, {I2, I4, I5}}.
Now, Join step is complete and Prune step will be used to
reduce the size of C3. Prune step helps to avoid heavy
computation due to large Ck.
16
17
Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5},
{I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
Lets take l = {I1,I2,I5}.
Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
19
R2: I1 ^ I5 I2
R3: I2 ^ I5 I1
20
R4: I1 I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: I2 I1 ^ I5
Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.
R6: I5 I1 ^ I2
Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.
In this way, We have found three strong
association rules.
21
Overview
Basic Concepts of Association Rule Mining
The Apriori Algorithm (Mining single
dimensional boolean association rules)
Methods to Improve Aprioris Efficiency
Frequent-Pattern Growth (FP-Growth)
Method
From Association Analysis to Correlation
Analysis
Summary
22
23
Overview
Basic Concepts of Association Rule Mining
The Apriori Algorithm (Mining single
dimensional boolean association rules)
Methods to Improve Aprioris Efficiency
Frequent-Pattern Growth (FP-Growth)
Method
From Association Analysis to Correlation
Analysis
Summary
24
25
FP-Growth Method : An
Example
TID
List of Items
T100
I1, I2, I5
T100
I2, I4
T100
I2, I3
T100
I1, I2, I4
T100
I1, I3
T100
I2, I3
T100
I1, I3
T100
I1, I2 ,I3, I5
T100
I1, I2, I3
26
27
Sup Node
Coun -link
t
I2
I1
I3
I4
I5
I2:
7
I1:
4
I5:
1
I3:
2
I3:
2
I1:
2
I4:
1
I3:
2
I4:
1
I5:
1
28
29
Conditional pattern
base
Conditional
FP-Tree
Frequent pattern
generated
I5
<I2:2 , I1:2>
I2 I5:2, I1 I5:2, I2 I1
I5: 2
I4
<I2: 2>
I2 I4: 2
I3
I2
{(I2: 4)}
<I2: 4>
I2 I1: 4
Mining the FP-Tree by creating conditional (sub) pattern
bases
30
31
32
Overview
Basic Concepts of Association Rule Mining
The Apriori Algorithm (Mining single
dimensional boolean association rules)
Methods to Improve Aprioris Efficiency
Frequent-Pattern Growth (FP-Growth)
Method
From Association Analysis to Correlation
Analysis
Summary
33
Correlation Concepts
Two item sets A and B are independent (the
occurrence of A is independent of the
occurrence of item set B) iff
P(A B) = P(A) P(B)
35
36
37
Correlation Rules
A correlation rule is a set of items {i1, i2 , .in},
where the items occurrences are correlated.
The correlation value is given by the correlation
formula and we use square test to determine if
correlation is statistically significant. The square
test can also determine the negative correlation. We
can also form minimal correlated item sets, etc
Limitations: square test is less accurate on the
data tables that are sparse and can be misleading
for the contingency tables larger then 2x2
38
Summary
Association Rule Mining
Finding interesting association or correlation relationships.
39
Questions ?
40