dmunit2
dmunit2
Data Mining 1
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns
Data Mining 2
Frequent Pattern
• Frequent Pattern: a pattern (a set of items, subsequences, substructures, etc.) that
occurs frequently in a data set.
• For example, a set of items, such as milk and bread, that appear frequently together in
a transaction data set is a frequent itemset.
• A subsequence, such as buying first a PC, then a digital camera, and then a memory
card, if it occurs frequently in a shopping history database, is a (frequent) sequential
pattern.
• A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure
occurs frequently, it is called a (frequent) structured pattern.
Data Mining 3
Frequent Pattern
Market Basket Analysis
• Frequent Pattern: a pattern that occurs frequently in a data set.
– A set of items that appear frequently together in a transaction data set is called as a
frequent itemset.
• An example of frequent itemset mining is market basket analysis.
– This process analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”.
– If we think of the universe as the set of items available at the store, then each item has a
Boolean variable representing the presence or absence of that item.
– Each basket can then be represented by a Boolean vector of values assigned to these
variables.
– The Boolean vectors can be analyzed for buying patterns that reflect items that are
frequently associated or purchased together.
– These patterns can be represented in the form of association rules.
Data Mining 4
Basic Concepts: Frequent Patterns
Tid Items bought
• itemset: A set of one or more items
10 Beer, Nuts, Diaper
– k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs • (absolute) support of X: Frequency of an
40 Nuts, Eggs, Milk itemset X.
50 Nuts, Coffee, Diaper, Eggs, Milk – Absolute Support of {Beer} is 3
• (relative) support of X is the fraction of
Customer Customer transactions that contains X (i.e., the
buys both buys diaper probability that a transaction contains X).
– Relative Support of {Beer} is 3/5
• An itemset X is frequent if X’s support
is no less than a minsup threshold.
Customer
buys beer
Data Mining 5
Basic Concepts: Association Rules
Association Rule
– An implication expression of the form X Y, where X and Y are itemsets
Data Mining 6
Basic Concepts: Association Rules
Association Rule
– An implication expression of the form X Y, where X and Y are itemsets
Data Mining 7
Why Use Support and Confidence?
• Support is an important measure because a rule that has very low support may occur
simply by chance.
– A low support rule may be uninteresting from a business perspective because it may not be
profitable to promote items that customers seldom buy together
– For these reasons, support is often used to eliminate uninteresting rules
Data Mining 8
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all rules
having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally not feasible!
Data Mining 9
Mining Association Rules
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different
confidence
• Thus, we may decouple the support and confidence requirements
Data Mining 10
Association Rule Mining
• The problem of mining association rules can be reduced to that of mining frequent
itemsets.
Data Mining 11
Association Rules - Example
Data Mining 12
Association Rules - Example
Data Mining 13
Frequent Itemset Generation
An itemset lattice
Data Mining 14
Frequent Itemset Generation
• Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1
d d 1
Data Mining 16
Frequent Itemset Generation Strategies
• Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
– The Apriori principle is an effective way to eliminate some of the candidate
itemsets without counting their support values.
Data Mining 17
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns
Data Mining 18
Reducing Number of Candidates
Apriori Principle
• Apriori Principle: If an itemset is frequent, then all of its subsets
must also be frequent.
• Apriori principle holds due to the following property of the support measure:
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
Data Mining 19
Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
supersets ABCDE
Data Mining 20
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Data Mining 21
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Eliminate infrequent 1-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Data Mining 22
Illustrating Apriori Principle
Minimum Support = 3
Generate 2-itemset candidates
If every subset is considered,
6C + 6C + 6C
1 2 3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Data Mining 23
Illustrating Apriori Principle
Data Mining 24
Illustrating Apriori Principle
Data Mining 27
Apriori Algorithm:
Candidate Generation: Fk-1 x Fk-1 Method
• Merge two frequent (k-1)-itemsets if their first (k-2) items are identical
• F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
Data Mining 28
Apriori Algorithm:
Candidate Pruning
• Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of frequent 3-itemsets
• Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
Data Mining 29
Apriori Algorithm:
Support Counting of Candidate Itemsets
• Scan the database of transactions to determine the support of each candidate itemset
– Must match every candidate itemset against every transaction, which is an
expensive operation
Data Mining 30
Apriori Algorithm
Data Mining 31
Apriori Algorithm
Data Mining 32
Apriori Algorithm - An Example
Data Mining 33
Support Counting Using Hash Tree
• Why counting supports of candidates a problem?
– The total number of candidates can be very huge
– One transaction may contain many candidates
– Must match every candidate itemset against every transaction, which is an
expensive operation
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a transaction
Data Mining 34
Support Counting Using Hash Tree
Subset Operation
• Enumerating subsets of three items from a transaction t
Data Mining 35
Support Counting Using Hash Tree
Generate Candidate Hash Tree
• Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7},
{3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
Data Mining 36
Support Counting Using Hash Tree
Generate Candidate Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining 37
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining 38
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining 39
Support Counting Using Hash Tree
Traverse Candidate Hash Tree to Update Support Counts
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 9 out of 15 candidates
Data Mining 40
Factors Affecting Complexity of Apriori Algorithm
• Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent itemsets
• Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O costs may also
increase
• Size of database
– since Apriori makes multiple passes, run time of algorithm may increase with number of
transactions
• Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and number of subsets in a transaction
increases with its width
Data Mining 41
Effect of Support Threshold
• Effect of support threshold on the number of candidate and frequent itemsets
Data Mining 42
Effect of Average Transaction Width
• Effect of average transaction width on the number of candidate and frequent itemsets
Data Mining 43
Effect of Support Distribution
• How to set the appropriate minsup threshold?
– If minsup is set too high, we could miss itemsets involving interesting rare items
(e.g., expensive products)
Data Mining 44
Multiple Minimum Support
• How to apply multiple minimum supports?
– MS(i): minimum support for item i
– e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
Data Mining 45
Multiple Minimum Support
• Order the items according to their minimum support (in ascending order)
– e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– Ordering: Broccoli, Salmon, Coke, Milk
Data Mining 46
Multiple Minimum Support
• Modifications to Apriori:
– In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets of size k
Data Mining 47
Rule Generation in Apriori Algorithm
• Given a frequent itemset L, find all non-empty subsets f L such that candidate
rule f L – f satisfies the minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D ABD C ACD B BCD A
D ABC C ABD B ACD A BCD ,
AB CD AC BD AD BC
CD AB BD AC BC AD
Data Mining 48
Rule Generation in Apriori Algorithm
• How to efficiently generate rules from frequent itemsets?
• But confidence of rules generated from the same itemset has an anti-monotone
property
– E.g., Suppose {A,B,C,D} is a frequent 4-itemset:
Data Mining 49
Rule Generation in Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
Data Mining 51
Compact Representation of Frequent Itemsets
• The number of frequent itemsets produced from a transaction data set can be very
large.
• Some produced itemsets can be redundant because they have identical support as their
supersets
• It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived. Need a compact representation
– Maximal Frequent Itemsets and
– Closed Frequent Itemsets
Data Mining 52
Maximal Frequent Itemsets
Maximal Frequent Itemset: A maximal frequent itemset is defined as a frequent
itemset for which none of its immediate supersets are frequent.
Data Mining 53
Maximal Frequent Itemsets
All frequent itemsets can be derived from
maximal frequent itemsets ad, ace, bcde.
Any frequent itemset
a maximal frequent itemset
Data Mining 54
Maximal Frequent Itemsets
• Despite providing a compact representation, maximal frequent itemsets do not contain
the support information of their subsets.
• For example, the support of the maximal frequent itemsets {a, c, e}, {a, d}, and
{b,c,d,e} do not provide any hint about the support of their subsets.
• An additional pass over the data set is therefore needed to determine the support
counts of the non-maximal frequent itemsets.
Data Mining 55
Closed Frequent Itemsets
Closed Itemset: An itemset X is closed if none of its immediate supersets has exactly
the same support count as X.
Data Mining 56
Closed Frequent Itemsets
All subsets of a closed frequent
itemset are frequent and their
supports is greater than or equal to
the support of that closed frequent
itemset.
Data Mining 57
Maximal vs Closed Itemsets
# Closed = 9
# Maximal = 4
Data Mining 58
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Data Mining 59
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns
Data Mining 60
FP-Growth (Frequent Pattern Growth) Algorithm
• FP-growth algorithm that takes a radically different approach to discovering
frequent itemsets.
– The algorithm does not subscribe to the generate-and-test paradigm of Apriori
• FP-growth algorithm encodes the data set using a compact data structure called an
FP-tree and extracts frequent itemsets directly from this structure.
– Use a compressed representation of the database using an FP-tree
– Once an FP-tree has been constructed, it uses a recursive divide-and-conquer
approach to mine the frequent itemsets
Data Mining 61
FP-Tree Construction
• An FP-tree is a compressed representation of the input data.
• It is constructed by reading the data set one transaction at a time and mapping each
transaction onto a path in the FP-tree.
– Different transactions can have several items in common, their paths may overlap.
– The more the paths overlap with one another, the more compression we can
achieve using the FP-tree structure.
Data Mining 62
FP-Tree Construction
• Each node in the tree contains the label of an item along with a counter that shows the
number of transactions mapped onto the given path.
– Initially, the FP-tree contains only the root node represented by the null symbol.
– Every transaction maps onto one of the paths in the FP-tree.
• The size of an FP-tree is typically smaller than the size of the uncompressed data
because many transactions in market basket data often share a few items in common.
– best-case scenario, all transactions have same set of items
FP-tree contains only a single branch.
– worst-case scenario happens when every transaction has a unique set of items
FP-tree is effectively the same as the size of the original data.
– physical storage requirement for FP-tree is higher because it requires additional
space to store pointers between nodes and counters for each item.
Data Mining 63
FP-Tree Construction
TID Items After reading TID=1:
1 {A,B}
2 {B,C,D}
null
3 {A,C,D,E}
4 {A,D,E} A:1
5 {A,B,C}
6 {A,B,C,D} B:1
7 {B,C} After reading TID=3:
8 {A,B,C} After reading TID=2:
null
9 {A,B,D}
10 {B,C,E} B:1 null
A:2
A:1 B:1
B:1 C:1 C:1
Data Mining 66
Finding Frequent Itemsets Ending with e
1. The first step is to gather all the paths containing node e. These initial paths are called prefix
paths
2. From the prefix paths, the support count for e is obtained by adding the support counts
associated with node e. Assuming that the minimum support count is 2, {e} is declared a
frequent itemset because its support count is 3.
3. Because {e} is frequent, the algorithm has to solve the subproblems of finding frequent
itemsets ending in de, ce, be, and ae. Before solving these subproblems, it must first convert
the prefix paths into a conditional FP-tree, which is structurally similar to an FP-tree, except
it is used to find frequent itemsets ending with a particular suffix.
– First, the support counts along the prefix paths must be updated because some of the counts include
transactions that do not contain item e.
– The prefix paths are truncated by removing the nodes for e.
– After updating the support counts along the prefix paths, some of the items may no longer be frequent
• the node b appears only once and has a support count equal to 1, which means that there is only one transaction
that contains both b and e. Item b can be safely ignored from subsequent analysis because all itemsets ending in
be must be infrequent.
4. FP-growth uses the conditional FP-tree for e to solve the subproblems of finding frequent
itemsets ending in de, ce, and ae.
Data Mining 67
Prefix Paths Ending with e
TID Items FP Tree null
1 {A,B}
2 {B,C,D}
A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} B:5 C:3
6 {A,B,C,D} C:1 D:1
7 {B,C}
Header table D:1
8 {A,B,C} C:3 E:1
Item Pointer D:1 E:1
9 {A,B,D} A D:1
10 {B,C,E} B E:1
C D:1
D null
E
A:7 B:3
C:3
C:1 D:1
E:1
Data Mining 68
Conditional FP-Tree for e
TID Items null
1 {A,B}
Prefix Paths Ending with e
2 {B,C,D}
A:7 B:3
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} C:3
6 {A,B,C,D} C:1 D:1
7 {B,C}
8 {A,B,C}
9 {A,B,D}
D:1 E:1 E:1 Conditional FP-Tree for e
10 {B,C,E} E:1 null
minsup=2
A:2 C:1
To create Conditional FP-Tree for e
• Update support counts because paths without e are removed
• e is frequent (support=3), Remove e nodes from prefix paths
C:1 D:1
• Remove infrequent nodes
D:1
Data Mining 69
Conditional FP-Tree for de
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with de
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D} C:1 D:1
10 {B,C,E}
minsup=2 D:1
D:1
de is frequent (support=2)
A:2
Data Mining 70
Conditional FP-Tree for ce
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with ce
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2 C:1
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D} C:1
10 {B,C,E}
minsup=2 D:1
ce is frequent (support=2)
A:1
Data Mining 71
Conditional FP-Tree for ae
TID Items
1 {A,B}
Conditional FP-Tree for e Prefix Paths Ending with ae
null
2 {B,C,D} null
3 {A,C,D,E}
4 {A,D,E} C:1
5 {A,B,C}
A:2
A:2
6 {A,B,C,D}
7 {B,C}
8 {A,B,C} C:1 D:1
9 {A,B,D}
10 {B,C,E}
ae is frequent (support=2)
minsup=2 D:1
Data Mining 72
Frequent Itemsets Ordered by Suffixes
TID Items
Suffix Frequent Itemsets
1 {A,B}
2 {B,C,D} E {E}, {D,E}, {A,D,E}, {C,E}, {A, E},
3 {A,C,D,E} D {D}, {C,D}, {B,C,D}, {A,C,D}, {B,D}, {A,B,D}, {A,D}
4 {A,D,E}
5 {A,B,C} C {C}, {B,C}, {A,B,C}, {A,C}
6 {A,B,C,D} B {B}, {A,B}
7 {B,C}
8 {A,B,C} A {A}
9 {A,B,D}
10 {B,C,E} null
minsup=2
A:7 B:3
B:5 C:3
C:1 D:1
FP Tree
Header table D:1
C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D
E Data Mining 73
• Frequent Itemsets, Association Rules
• Apriori Algorithm
• Compact Representation of Frequent Itemsets
• FP-Growth Algorithm: An Alternative Frequent
Itemset Generation Algorithm
• Evaluation of Association Patterns
Data Mining 74
Evaluation of Association Patterns
• Association rule algorithms tend to produce too many rules
– many of them are uninteresting or redundant
– {A,B} {D} is Redundant if {A,B,C} {D} and {A,B} {D}
have same support & confidence
– An association rule X −→ Y is redundant if there exists another rule X’→ Y’,
where X is a subset of X’ and Y is a subset of Y’, such that the support and
confidence for both rules are identical.
• In the original formulation of association rules, support & confidence are the only
measures used
Data Mining 75
Computing Interestingness Measure
• Given a rule X Y, information needed to compute rule interestingness can be
obtained from a contingency table
Data Mining 76
Drawback of Confidence
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Data Mining 77
Measure for Association Rules
• So, what kind of rules do we really want?
– Confidence(X Y) should be sufficiently high
• To ensure that people who buy X will more likely buy Y than not buy Y
Data Mining 78
Statistical Independence
• Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)
Data Mining 79
Statistical-Based Measures for Interestingness
• Statistical-Based Measures use statistical dependence information.
• Two of them are Lift and Interest (they are equal).
Data Mining 80
Example: Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Data Mining 81
Example: Lift/Interest
2000 / 5000
lift( B, C ) 0.89
3000 / 5000 * 3750 / 5000
1000 / 5000
lift( B, C ) 1.33
3000 / 5000 *1250 / 5000
Data Mining 82
Limitations of Interest Factor
• We expect the words data and mining to appear together more frequently than the
words compiler and mining in a collection of computer science articles.
Data Mining 83
Different
Measures
• There are lots of
measures proposed in
the literature
Data Mining 84
Properties of A Good Measure
3 properties a good measure M must satisfy:
– M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain
unchanged
– M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or
P(A)] remain unchanged
Data Mining 85