Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
LECTURE 2
Dr. Mahmoud Mounir
mahmoud.mounir@cis.asu.edu.eg
Data Mining:
Concepts and Techniques
(3rd ed.)
— Chapter 6 —
◼ Basic Concepts
Evaluation Methods
◼ Summary
3
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Juicw and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
4
Why Is Freq. Pattern Mining Important?
◼ Broad applications
5
Basic Concepts: Frequent Patterns
6
Basic Concepts: Association Rules
Tid Items bought ◼ Find all the rules X → Y with
10 Juice, Nuts, Diaper
minimum support and confidence
20 Juice, Coffee, Diaper
30 Juice, Diaper, Eggs ◼ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X U Y
50 Nuts, Coffee, Diaper, Eggs, Milk
◼ confidence, c, conditional
probability that a transaction
Customer Customer
buys both
having X also contains Y
buys
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Juice :3, Nuts:3, Diaper:4, Eggs:3,
Customer {Juice, Diaper}:3
buys juice ◼ Association rules: (many more!)
◼ Juice → Diaper (60%, 100%)
◼ Diaper → Juice (60%, 75%)
7
Closed Patterns and Max-Patterns
◼ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
◼ Min_sup = 1.
◼ What is the set of closed itemset?
◼ <a1, …, a100>: 1
◼ < a1, …, a50>: 2
◼ What is the set of max-pattern?
◼ <a1, …, a100>: 1
◼ What is the set of all patterns?
◼ !!
8
Computational Complexity of Frequent Itemset
Mining
◼ How many itemsets are potentially to be generated in the worst case?
◼ The number of frequent itemsets to be generated is senstive to the
minsup threshold
◼ When minsup is low, there exist potentially an exponential number of
frequent itemsets
◼ The worst case: MN where M: # distinct items, and N: max length of
transactions
◼ The worst case complexty vs. the expected probability
◼ Ex. Suppose Walmart has 104 kinds of products
◼ The chance to pick up one product 10-4
◼ The chance to pick up a particular set of 10 products: ~10-40
◼ What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
9
Chapter 5: Mining Frequent Patterns, Association
and Correlations: Basic Concepts and Methods
◼ Basic Concepts
Evaluation Methods
◼ Summary
10
Scalable Frequent Itemset Mining Methods
Approach
Data Format
11
The Downward Closure Property and Scalable
Mining Methods
◼ The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent
diaper}
◼ i.e., every transaction having {juice, diaper, nuts} also
@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
12
Apriori: A Candidate Generation & Test Approach
13
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
15
Implementation of Apriori
17
The Apriori Algorithm—An Example
Min Support = 2
Confidence = 70%
18
The Apriori Algorithm—An Example
19
The Apriori Algorithm—An Example
20
Summary
21
Ref: Basic Concepts of Frequent Pattern Mining
22
Ref: Apriori and Its Improvements
◼ R. Agrawal and R. Srikant. Fast algorithms for mining association rules.
VLDB'94
◼ H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94
◼ A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining
association rules in large databases. VLDB'95
◼ J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD'95
◼ H. Toivonen. Sampling large databases for association rules. VLDB'96
◼ S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and
implication rules for market basket analysis. SIGMOD'97
◼ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications. SIGMOD'98
23
Ref: Depth-First, Projection-Based FP Mining
◼ R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation
of frequent itemsets. J. Parallel and Distributed Computing, 2002.
◼ G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
FIMI'03
◼ B. Goethals and M. Zaki. An introduction to workshop on frequent itemset mining
implementations. Proc. ICDM’03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI’03), Melbourne, FL, Nov. 2003
◼ J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00
◼ J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
Projection. KDD'02
◼ J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without
Minimum Support. ICDM'02
◼ J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining
Frequent Closed Itemsets. KDD'03
24
Ref: Vertical Format and Row Enumeration Methods
25
Ref: Mining Correlations and Interesting Rules
◼ S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97.
◼ M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94.
◼ R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and Measures of Interest.
Kluwer Academic, 2001.
◼ C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining
causal structures. VLDB'98.
◼ P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure
for Association Patterns. KDD'02.
◼ E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03.
◼ T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework", Data Mining and Knowledge Discovery, 21(3):371-
397, 2010
26