chap 4-Mining Frequent Patterns, Association-Lecture 6-2
chap 4-Mining Frequent Patterns, Association-Lecture 6-2
Association and
Correlations: Basic
Concepts and Methods
1
Topics
Basic Concepts
Evaluation Methods
Summary
2
What Is Frequent Pattern Analysis?
Frequent pattern:
a pattern that occurs frequently in a data set
(a set of items (frequent itemsets), subsequences, substructures, etc.)
Motivation:
Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
Can we automatically classify web documents?
Applications:
Basket data analysis, cross-marketing, sale campaign analysis, Web
log (click stream) analysis
3
Why Is Freq. Pattern Mining Important?
Broad applications
4
Problem Definition:
The problem of association rule mining is defined as:
5
Example 1:
The set of items is
and a small database containing the items in transactions is shown in
the table
(1 codes presence and 0 absence of an item in a transaction)
meaning that if butter and bread are bought, customers also buy milk
6
Example:
Example database with 4 items and 5 transactions
7
Example:
Important concepts of Association Rule Mining:
Three techniques:
SUPPORT
CONFIDENCE
LIFT
support(item) = transactions of item/total transactions
support, s, Supp(X ^ Y) is
8
Example:
confidence, c, conf(X ^ Y) is
It is been calculated for whether the product sales are popular
on individual sales or through combined sales
Solution:
note that the item X ^ Y is
Supp(X ^ Y)= 1/5=0.2
note that the item X is
Supp(X)= 1/5=0.2
since it occurs in 20% of all transactions (1 out of 5 transactions).
conf(X ^ Y)= 0.2/0.2=1=100% 9
Example:
In the example database, above, find lift (X=>Y)
For another rule :
Solution:
note that the item X ^ Y is
Supp(X ^ Y)= 1/5=0.2
note that the item X is
Item y is
Therefore,
lift (X=>Y) = 0.2/ 0.4x0.4= 1.25
10
Example:
11
Basic Concepts: Frequent Patterns
Ti Items bought support count of X:
d
10 Beer, Nuts, Diaper
Frequency or occurrence of an
20 Beer, Coffee, Diaper
itemset X
30 Beer, Diaper, Eggs
40 Nuts, Eggs, Milk (relative) support, s,
50 Nuts, Coffee, Diaper, Eggs, is the fraction of transactions
Milk
that contains X (i.e., the
Custom Customer probability that a transaction
er buys contains X)
buys diaper
both
An itemset X is frequent
if X’s support is no less than a
minsup threshold
Customer
buys beer
12
Example 2: Association Rules
Tid Items bought Find all the rules X Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs Let minsup = 50%, minconf = 50%
40 Nuts, Eggs, Milk Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
50 Nuts, Coffee, Diaper, Eggs, Milk {Beer, Diaper}:3
Customer
buys both
Customer For the association rules shown
buys below:
Beer Diaper (60%, 100%)
diaper
Basic Concepts
Evaluation Methods
Summary
14
Scalable Frequent Itemset Mining Methods
Approach
15
The Downward Closure Property and Scalable
Mining Methods
The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
diaper}
i.e., every transaction having {beer, diaper, nuts} also
17
Apriori: A Candidate Generation & Test Approach
18
Apriori: A Candidate Generation & Test Approach
20
The Apriori Algorithm (Pseudo-Code)
21
Example 3:
22
Example:
Steps:
3. To discover the set of frequent 2-itemsets, L2, the algorithm
uses the join L1 on L1 to generate a candidate set of 2-
itemsets, C2. No candidates are removed from C2 during the
prune step because each subset of the candidates is also
frequent.
4. Next, the transactions in D are scanned and the support
count of each candidate itemset In C2 is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined,
consisting of those candidate2- item sets in C2 having minimum
support.
24
Example:
Steps:
6. The generation of the set of candidate 3-itemsets,C3, From
the join step, we first get C3 =L2x L2 = ({I1, I2, I3}, {I1,
I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}.
Based on the Apriori property that all subsets of a frequent
item set must also be frequent, we can determine that the
four latter candidates cannot possibly be frequent.
7. The transactions in D are scanned in order to determine L3,
consisting of those candidate 3-itemsets in C3 having minimum
support.
25
Example:
26
Example:
27
Scalable Frequent Itemset Mining Methods
28
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the
database.
29
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Consider the following data:-
30
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
The frequency of each individual item is computed:-
31
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Let the minimum support be 3
A Frequent Pattern set is built which will contain all the elements whose
frequency is greater than or equal to the minimum support
After insertion of the relevant items, the set L looks like this:-
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
32
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, for each transaction, the respective Ordered-Item set is built
If the current item is contained, the item is inserted in the Ordered-Item set
for the current transaction
The following table is built for all the transactions:
33
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
34
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
36
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
37
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, all the Ordered-Item sets are
inserted into a Tree Data Structure.
38
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now, for each item, the Conditional Pattern Base is computed which is
path labels of all the paths which lead to any node of the given item in
the frequent-pattern tree.
Note that the items in the below table are arranged in the ascending order
of their frequencies.
39
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Now for each item, the Conditional Frequent Pattern Tree is built.
It is done by taking the set of elements that is common in all the paths in
the Conditional Pattern Base of that item and
calculating its support count by summing the support counts of all the paths
in the Conditional Pattern Base.
40
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
From the Conditional Frequent Pattern tree,
the Frequent Pattern rules are generated by
pairing the items of the Conditional Frequent Pattern Tree set to
the corresponding item as given in the below table.
The first scan of the database is the same as Apriori, which derives the
set of frequent items (1-itemsets) and their support counts
(frequencies).
42
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
T200 I2, I4
T300 I2, I3
T500 I2, I3
T600 I2, I3
T700 I2, I3
43
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
44
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
46
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
47
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
We first consider I5, which is the last item in L, rather than the first.
The reason for starting at the end of the list will become apparent as we
explain the FP-tree mining process.
I5 occurs in two branches of the FP-tree of Figure 5.7.
(The occurrences of I5 can easily be found by following its chain of
node-links.)
The paths formed by these branches are I2, I1, I5: 1 and I2, I1, I3, I5: 1.
48
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
49
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
50
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local
frequent items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
51
Example 2:
Construct FP-tree from a Transaction Database shown below and a
summary of the mining of FP tree
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
Header Table
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
52
Database and a summary of the mining of FP
tree
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 3
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1. Scan DB once, find Header Table
frequent 1-itemset (single
item pattern) Item frequency head f:4 c:1
f 4
2. Sort frequent items in c 4 c:3 b:1 b:1
frequency descending a 3
order, f-list b 3 a:3 p:1
m 3
3. Scan DB again, construct p 3
FP-tree m:2 b:1
F-list = f-c-a-b-m-p
p:2 m:1
53
Partition Patterns and Databases
Patterns containing p
…
Pattern f
54
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
pattern base
database partition
Method
For each frequent item, construct its conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
57
Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
Basic Concepts
Evaluation Methods
Summary
58
Interestingness Measure: Correlations (Lift)
We could easily end up with thousands or even millions of patterns, many
of which might not be interesting.
60
Interestingness Measure: Correlations (Lift)
This argument would have been acceptable except that the fraction of
people who drink coffee, regardless of whether they drink tea, is
80%, while the fraction of tea drinkers who drink coffee is only 75%
61
Interestingness Measure: Correlations (Lift)
Lift:
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift
63
Interestingness Measure: Kulczynski
64
Interestingness Measure: IR (Imbalance Ratio)
IR (Imbalance Ratio): measure the imbalance of two
itemsets A and B in rule implications
EXAMPLE:
Imbalance Ratio (IR) presents a clear picture for all the three
datasets D4 through D6
D4 is balanced
D5 is imbalanced
D6 is very imbalanced