0% found this document useful (0 votes)

31 views

P8 FPBasic

The document discusses key concepts in mining frequent patterns and associations from transactional data, including: - Frequent itemsets are patterns that occur together above a minimum support threshold. Association rules represent implications between frequent itemsets. - There are too many possible frequent patterns to efficiently mine them all, so compressed representations like closed and max patterns are used. - Closed patterns retain the exact support counts while max patterns only indicate whether subpatterns are frequent or not. - The challenges of mining frequent patterns at scale require efficient algorithms to discover all patterns above the thresholds.

Uploaded by

Abdul Karim Maulana Afnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

P8 FPBasic

Uploaded by

Abdul Karim Maulana Afnan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Konsep Data Mining

Pertemuan 8 | Mining Frequent Patterns, Association and

Correlations: Basic Concepts and Methods

Slides adopted from Jiawei Han, Computer Science, Univ. Illinois at Urbana-Champaign, 2017
1
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

q Basic Concepts

q Efficient Pattern Mining Methods

q Pattern Evaluation

q Summary

2
Pattern Discovery: Basic Concepts

q What Is Pattern Discovery? Why Is It Important?

q Basic Concepts: Frequent Patterns and Association Rules

q Compressed Representation: Closed Patterns and Max-Patterns

3
What Is Pattern Discovery?
q What are patterns?
q Patterns: A set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set
q Patterns represent intrinsic and important properties of datasets
q Pattern discovery: Uncovering patterns from massive data sets
q Motivation examples:
q What products were often purchased together?
q What are the subsequent purchases after buying an iPad?
q What code segments likely contain copy-and-paste bugs?
q What word sequences likely form phrases in this corpus?

4
Pattern Discovery: Why Is It Important?
q Finding inherent regularities in a data set
q Foundation for many essential data mining tasks
q Association, correlation, and causality analysis
q Mining sequential, structural (e.g., sub-graph) patterns
q Pattern analysis in spatiotemporal, multimedia, time-series, and
stream data
q Classification: Discriminative pattern-based analysis
q Cluster analysis: Pattern-based subspace clustering
q Broad applications
q Market basket analysis, cross-marketing, catalog design, sale
campaign analysis, Web log analysis, biological sequence
5
Basic Concepts: k-Itemsets and Their Supports
q Itemset: A set of one or more items Tid Items bought
10 Beer, Nuts, Diaper
q k-itemset: X = {x1, …, xk}
20 Beer, Coffee, Diaper
q Ex. {Beer, Nuts, Diaper} is a 3-itemset
30 Beer, Diaper, Eggs
q (absolute) support (count) of X, sup{X}: 40 Nuts, Eggs, Milk
Frequency or the number of 50 Nuts, Coffee, Diaper, Eggs, Milk
occurrences of an itemset X
q (relative) support, s{X}: The fraction of
q Ex. sup{Beer} = 3
transactions that contains X (i.e., the
q Ex. sup{Diaper} = 4
probability that a transaction contains X)
q Ex. sup{Beer, Diaper} = 3
q Ex. s{Beer} = 3/5 = 60%
q Ex. sup{Beer, Eggs} = 1
q Ex. s{Diaper} = 4/5 = 80%
q Ex. s{Beer, Eggs} = 1/5 = 20%

6
Basic Concepts: Frequent Itemsets (Patterns)
q An itemset (or a pattern) X is frequent Tid Items bought
if the support of X is no less than a 10 Beer, Nuts, Diaper
minsup threshold σ 20 Beer, Coffee, Diaper

q Let σ = 50% (σ: minsup threshold) 30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk
For the given 5-transaction dataset
50 Nuts, Coffee, Diaper, Eggs, Milk
q All the frequent 1-itemsets:
q Beer: 3/5 (60%); Nuts: 3/5 (60%) q Why do these itemsets (shown on the
q Diaper: 4/5 (80%); Eggs: 3/5 (60%) left) form the complete set of frequent
q All the frequent 2-itemsets: k-itemsets (patterns) for any k?
q {Beer, Diaper}: 3/5 (60%) q Observation: We may need an
q All the frequent 3-itemsets? efficient method to mine a complete
q None set of frequent patterns

7
From Frequent Itemsets to Association Rules
q Comparing with itemsets, rules can be more telling Tid Items bought
10 Beer, Nuts, Diaper
q Ex. Diaper à Beer
20 Beer, Coffee, Diaper
q Buying diapers may likely lead to buying beers
30 Beer, Diaper, Eggs
q How strong is this rule? (support, confidence) 40 Nuts, Eggs, Milk
q Measuring association rules: X à Y (s, c) 50 Nuts, Coffee, Diaper, Eggs, Milk
q Both X and Y are itemsets Containing Containing
both diaper
q Support, s: The probability that a transaction
contains X È Y Beer {Beer} È Diaper
{Diaper}
q Ex. s{Diaper, Beer} = 3/5 = 0.6 (i.e., 60%)
q Confidence, c: The conditional probability that a Containing beer
transaction containing X also contains Y {Beer} È {Diaper} = {Beer, Diaper}

q Calculation: c = sup(X È Y) / sup(X) Note: X È Y: the union of two itemsets

n The set contains both X and Y
q Ex. c = sup{Diaper, Beer}/sup{Diaper} = ¾ = 0.75
8
Mining Frequent Itemsets and Association Rules
qAssociation rule mining Tid Items bought
q Given two thresholds: minsup, minconf 10 Beer, Nuts, Diaper
q Find all of the rules, X à Y (s, c) 20 Beer, Coffee, Diaper

q such that, s ≥ minsup and c ≥ minconf 30 Beer, Diaper, Eggs

40 Nuts, Eggs, Milk
q Let minsup = 50% 50 Nuts, Coffee, Diaper, Eggs, Milk
q Freq. 1-itemsets: Beer: 3, Nuts: 3,
qObservations:
Diaper: 4, Eggs: 3
q Mining association rules and
q Freq. 2-itemsets: {Beer, Diaper}: 3
mining frequent patterns are
very close problems
qLet minconf = 50%
q Scalable methods are needed
q Beer à Diaper (60%, 100%)
for mining large datasets
q Diaper à Beer (60%, 75%)

(Q: Are these all rules?)

9
Challenge: There Are Too Many Frequent Patterns!
q A long pattern contains a combinatorial number of sub-patterns
q How many frequent itemsets does the following TDB1 contain?
q TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
q Assuming (absolute) minsup = 1
q Let’s have a try
1-itemsets: {a1}: 2, {a2}: 2, …, {a50}: 2, {a51}: 1, …, {a100}: 1,
2-itemsets: {a1, a2}: 2, …, {a1, a50}: 2, {a1, a51}: 1 …, …, {a99, a100}: 1,
…, …, …, …
99-itemsets: {a1, a2, …, a99}: 1, …, {a2, a3, …, a100}: 1
100-itemset: {a1, a2, …, a100}: 1 A too huge set for any
one to compute or store!
q The total number of frequent itemsets:

10
Expressing Patterns in Compressed Form: Closed Patterns
q How to handle such a challenge?
q Solution 1: Closed patterns: A pattern (itemset) X is closed if X is
frequent, and there exists no super-pattern Y ‫ כ‬X, with the same
support as X
q Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
q Suppose minsup = 1. How many closed patterns does TDB1
contain?
q Two: P1: “{a1, …, a50}: 2”; P2: “{a1, …, a100}: 1”
q Closed pattern is a lossless compression of frequent patterns
q Reduces the # of patterns but does not lose the support
information!

11
q You will still be able to say: “{a2, …, a40}: 2”, “{a5, a51}: 1”
Expressing Patterns in Compressed Form: Max-Patterns
q Solution 2: Max-patterns: A pattern X is a max-pattern if X is
frequent and there exists no frequent super-pattern Y ‫ כ‬X
q Difference from close-patterns?
q Do not care the real support of the sub-patterns of a max-pattern
q Let Transaction DB TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
q Suppose minsup = 1. How many max-patterns does TDB1 contain?
q One: P: “{a1, …, a100}: 1”
q Max-pattern is a lossy compression!
q We only know {a1, …, a40} is frequent
q But we do not know the real support of {a1, …, a40}, …, any more!
q Thus in many applications, mining close-patterns is more desirable
than mining max-patterns
12
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

q Basic Concepts

q Efficient Pattern Mining Methods

q Pattern Evaluation

q Summary

13
Efficient Pattern Mining Methods
q The Downward Closure Property of Frequent Patterns

q The Apriori Algorithm

q Extensions or Improvements of Apriori

q Mining Frequent Patterns by Exploring Vertical Data Format

q FPGrowth: A Frequent Pattern-Growth Approach

q Mining Closed Patterns

14
The Downward Closure Property of Frequent Patterns
q Observation: From TDB1: T1: {a1, …, a50}; T2: {a1, …, a100}
q We get a frequent itemset: {a1, …, a50}
q Also, its subsets are all frequent: {a1}, {a2}, …, {a50}, {a1, a2}, …, {a1, …, a49}, …
q There must be some hidden relationships among frequent patterns!
q The downward closure (also called “Apriori”) property of frequent patterns
q If {beer, diaper, nuts} is frequent, so is {beer, diaper}
q Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
q Apriori: Any subset of a frequent itemset must be frequent
q Efficient mining methodology
q If any subset of an itemset S is infrequent, then there is no chance for S to
be frequent—why do we even have to consider S!? A sharp knife for pruning!

15
Apriori Pruning and Scalable Mining Methods
q Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not even be generated! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
q Scalable mining Methods: Three major approaches
q Level-wise, join-based approach: Apriori (Agrawal &
Srikant@VLDB’94)
q Vertical data format approach: Eclat (Zaki, Parthasarathy,
Ogihara, Li @KDD’97)
q Frequent pattern projection and growth: FPgrowth (Han, Pei,
Yin @SIGMOD’00)

16
Apriori: A Candidate Generation & Test Approach
q Outline of Apriori (level-wise, candidate generation and test)
q Initially, scan DB once to get frequent 1-itemset
q Repeat
q Generate length-(k+1) candidate itemsets from length-k frequent
itemsets
q Test the candidates against DB to find frequent (k+1)-itemsets
q Set k := k +1
q Until no frequent or candidate set can be generated
q Return all the frequent itemsets derived
17
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Fk : Frequent itemset of size k

K := 1;
Fk := {frequent items}; // frequent 1-itemset
While (Fk != Æ) do { // when Fk is non-empty
Ck+1 := candidates generated from Fk; // candidate generation
Derive Fk+1 by counting candidates in Ck+1 with respect to TDB at minsup;
k := k + 1
}
return Èk Fk // return Fk generated at each level

18
The Apriori Algorithm—An Example
Database TDB minsup = 2 Itemset sup
Itemset sup
{A} 2 F1
Tid Items C1 {B} 3
{A} 2
10 A, C, D {B} 3
{C} 3
20 B, C, E 1st scan {C} 3
{D} 1
30 A, B, C, E {E} 3
{E} 3
40 B, E
C2 Itemset sup C2 Itemset
F2 Itemset sup {A, B} 1 {A, B}
{A, C} 2 {A, C} 2
2nd scan {A, C}
{B, C} 2 {A, E} 1
{A, E}
{B, E} 3 {B, C} 2
{B, C}
{C, E} 2 {B, E} 3
{B, E}
{C, E} 2
{C, E}

C3 Itemset
3rd scan F3 Itemset sup
{B, C, E} {B, C, E} 2

19
Apriori: Implementation Tricks
q How to generate candidates?
self-join self-join
q Step 1: self-joining Fk
abc abd acd ace bcd
q Step 2: pruning
q Example of candidate-generation abcd acde
q F3 = {abc, abd, acd, ace, bcd}
pruned
q Self-joining: F3*F3
q abcd from abc and abd
q acde from acd and ace
q Pruning:
q acde is removed because ade is not in F3
q C4 = {abcd}
20
Candidate Generation: An SQL Implementation
self-join self-join
q Suppose the items in Fk-1 are listed
abc abd acd ace bcd
in an order
q Step 1: self-joining Fk-1 abcd acde
insert into Ck
pruned
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Fk-1 as p, Fk-1 as q
where p.item1= q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
q Step 2: pruning
for all itemsets c in Ck do
for all (k-1)-subsets s of c do
if (s is not in Fk-1) then delete c from Ck
21
Apriori: Improvements and Alternatives
q Reduce passes of transaction database scans
To be discussed in
q Partitioning (e.g., Savasere, et al., 1995) subsequent slides
q Dynamic itemset counting (Brin, et al., 1997)
q Shrink the number of candidates
To be discussed in
q Hashing (e.g., DHP: Park, et al., 1995) subsequent slides
q Pruning by support lower bounding (e.g., Bayardo 1998)
q Sampling (e.g., Toivonen, 1996)
q Exploring special data structures
q Tree projection (Agarwal, et al., 2001)
q H-miner (Pei, et al., 2001)
q Hypecube decomposition (e.g., LCM: Uno, et al., 2004)
22
Partitioning: Scan Database Only Twice
q Theorem: Any itemset that is potentially frequent in TDB must be frequent in at least
one of the partitions of TDB

Here is the p
roof!

TDB1 + TDB2 + ... + TDBk = TDB

q Method: Scan DB twice (A. Savasere, E. Omiecinski and S. Navathe, VLDB’95)

q Scan 1: Partition database so that each partition can fit in main memory (why?)
q Mine local frequent patterns in this partition
q Scan 2: Consolidate global frequent patterns
q Find global frequent itemset candidates (those frequent in at least one partition)
q Find the true frequency of those candidates, by scanning TDBi one more time
23
Direct Hashing and Pruning (DHP)
q DHP (Direct Hashing and Pruning): (J. Park, M. Chen, and P. Yu, SIGMOD’95)
q Hashing: Different itemsets may have the same hash value: v = hash(itemset)
q 1st scan: When counting the 1-itemset, hash 2-itemset to calculate the bucket count
q Observation: A k-itemset cannot be frequent if its corresponding hashing bucket
count is below the minsup threshold Itemsets Count
q Example: At the 1st scan of TDB, count 1-itemset, and {ab, ad, ce} 35

q Hash 2-itemsets in the transaction to its bucket {bd, be, de} 298

q {ab, ad, ce} …… …

{yz, qs, wt} 58
q {bd, be, de}
Hash Table
q …
q At the end of the first scan,
q if minsup = 80, remove ab, ad, ce, since count{ab, ad, ce} < 80

24
Exploring Vertical Data Format: ECLAT
q ECLAT (Equivalence Class Transformation): A depth-first search A transaction DB in Horizontal
Data Format
algorithm using set intersection [Zaki et al. @KDD’97]
Tid Itemset
q Tid-List: List of transaction-ids containing an itemset 10 a, c, d, e
20 a, b, e
q Vertical format: t(e) = {T10, T20, T30}; t(a) = {T10, T20}; t(ae) = {T10, T20}
30 b, c, e
q Properties of Tid-Lists
The transaction DB in Vertical
q t(X) = t(Y): X and Y always happen together (e.g., t(ac} = t(d}) Data Format
Item TidList
q t(X) Ì t(Y): transaction having X always has Y (e.g., t(ac) Ì t(ce))
a 10, 20
q Deriving frequent patterns based on vertical intersections b 20, 30
q Using diffset to accelerate mining c 10, 30
d 10
q Only keep track of differences of tids
e 10, 20, 30
q t(e) = {T10, T20, T30}, t(ce) = {T10, T30} → Diffset (ce, e) = {T20}
25
Why Mining Frequent Patterns by Pattern Growth?
q Apriori: A breadth-first search mining algorithm
q First find the complete set of frequent k-itemsets
q Then derive frequent (k+1)-itemset candidates
q Scan DB again to find true frequent (k+1)-itemsets
q Motivation for a different mining methodology
q Can we develop a depth-first search mining algorithm?
q For a frequent itemset ρ, can subsequent search be confined
to only those transactions that containing ρ?
q Such thinking leads to a frequent pattern growth approach:
q FPGrowth (J. Han, J. Pei, Y. Yin, “Mining Frequent Patterns
without Candidate Generation,” SIGMOD 2000)
26
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
After inserting the 1st frequent
300 {b, f, h, j, o, w} f, b
Itemlist: “f, c, a, m, p”
400 {b, c, k, s, p} c, b, p
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:1
a 3
3. Scan DB again, construct FP-tree
b 3
q The frequent itemlist of each transaction is m:1
inserted as a branch, with shared sub- m 3
branches merged, counts accumulated p 3 p:1
27
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting the 2nd frequent
400 {b, c, k, s, p} c, b, p itemlist “f, c, a, b, m”
500 {a, f, c, e, l, p, m, n} f, c, a, m, p {}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:2
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:2
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:2
a 3
3. Scan DB again, construct FP-tree
b 3
q The frequent itemlist of each transaction is m:1 b:1
inserted as a branch, with shared sub- m 3
branches merged, counts accumulated p 3 p:1 m:1
28
Example: Construct FP-tree from a Transaction DB
TID Items in the Transaction Ordered, frequent itemlist
100 {f, a, c, d, g, i, m, p} f, c, a, m, p
200 {a, b, c, f, l, m, o} f, c, a, b, m
300 {b, f, h, j, o, w} f, b After inserting all the
400 {b, c, k, s, p} c, b, p frequent itemlists
500 {a, f, c, e, l, p, m, n} f, c, a, m, p
{}
1. Scan DB once, find single item frequent pattern: Header Table
Let min_support = 3 f:4 c:1
Item Frequency header
f:4, a:3, c:4, b:3, m:3, p:3
f 4 c:3 b:1 b:1
2. Sort frequent items in frequency descending
c 4
order, f-list F-list = f-c-a-b-m-p a:3 p:1
a 3
3. Scan DB again, construct FP-tree
b 3
q The frequent itemlist of each transaction is m:2 b:1
inserted as a branch, with shared sub- m 3
branches merged, counts accumulated p 3 p:2 m:1
29
Mining FP-Tree: Divide and Conquer
Based on Patterns and Data
q Pattern mining can be partitioned according to current patterns
q Patterns containing p: p’s conditional database: fcam:2, cb:1
q p’s conditional database (i.e., the database under the condition that p exists):
q transformed prefix paths of item p
q Patterns having m but no p: m’s conditional database: fca:2, fcab:1
q …… …… {}
min_support = 3 Conditional database of each pattern

Item Frequency Header f:4 c:1 Item Conditional database

f 4 c f:3
c:3 b:1 b:1
c 4 a fc:3
a 3 a:3 p:1 b fca:1, f:1, c:1
b 3 m fca:2, fcab:1
m 3 m:2 b:1 p fcam:2, cb:1
p 3
p:2 m:1
30
Mine Each Conditional Database Recursively
min_support = 3 q For each conditional database
Conditional Data Bases q Mine single-item patterns
item cond. data base
q Construct its FP-tree & mine it
c f:3
a fc:3 p’s conditional DB: fcam:2, cb:1 → c: 3
b fca:1, f:1, c:1 m’s conditional DB: fca:2, fcab:1 → fca: 3
m fca:2, fcab:1 b’s conditional DB: fca:1, f:1, c:1 → ɸ
p fcam:2, cb:1

{} {} {} {} Actually, for single branch FP-tree, all the

frequent patterns can be generated in one shot
f:3 f:3 f:3 f:3 m: 3
c:3 cm’s FP-tree cam’s FP-tree fm: 3, cm: 3, am: 3
c:3
am’s FP-tree fcm: 3, fam:3, cam: 3
a:3
m’s FP-tree Then, mining m’s FP-tree: fca:3 fcam: 3
31
A Special Case: Single Prefix Path in FP-tree
q Suppose a (conditional) FP-tree T has a shared single prefix-path P
q Mining can be decomposed into two parts
{} q Reduction of the single prefix path into one node
a1:n1 q Concatenation of the mining results of the two parts
a2:n2 r1
{}
a3:n3
a1:n1
Ú r1 =
a2:n2
+ b1:m1 c1:k1
b1:m1 c1:k1
a3:n3 c2:k2 c3:k3
c2:k2 c3:k3

32
FPGrowth: Mining Frequent Patterns by Pattern Growth
q Essence of frequent pattern growth (FPGrowth) methodology
q Find frequent single items and partition the database based on each
such single item pattern
q Recursively grow frequent patterns by doing the above for each
partitioned database (also called the pattern’s conditional database)
q To facilitate efficient processing, an efficient data structure, FP-tree, can
be constructed
q Mining becomes
q Recursively construct and mine (conditional) FP-trees
q Until the resulting FP-tree is empty, or until it contains only one path—
single path will generate all the combinations of its sub-paths, each of
which is a frequent pattern
33
Scaling FP-growth by Item-Based Data Projection
q What if FP-tree cannot fit in memory?—Do not construct FP-tree
q “Project” the database based on frequent single items
q Construct & mine FP-tree for each projected DB
q Parallel projection vs. partition projection
q Parallel projection: Project the DB on each frequent item
q Space costly, all partitions can be processed in parallel
q Partition projection: Partition the DB in order
q Passing the unprocessed parts to subsequent partitions
Trans. DB Parallel projection Partition projection

f2 f3 f4 g h f4-proj. DB f3-proj. DB f4-proj. DB f3-proj. DB

f3 f4 i j Assume only f’s are f2 f3 f2 f2 f3 f1

f2 f4 k frequent & the f3 f1 f3 …
frequent item f2 f2 will be projected to f3-proj.
f1 f3 h f2 …
ordering is: f1-f2-f3-f4 DB only when processing f4-
… … … proj. DB
34
CLOSET+: Mining Closed Itemsets by Pattern-Growth
{} q Efficient, direct mining of closed itemsets TID Items
q Intuition: 1 acdef
a1:n1 2 abe
q If an FP-tree contains a single branch as
a2:n1 3 cefg
shown left
4 acdf
a3:n1 q “a1,a2, a3” should be merged
Let minsupport = 2
q Itemset merging: If Y appears in every
a:3, c:3, d:2, e:3, f:3
b1:m1 c1:k1 occurrence of X, then Y is merged with X
F-List: a-c-e-f-d
q d-proj. db: {acef, acf} → acfd-proj. db: {e}
c2:k2 c3:k3 q Final closed itemset: acfd:2
q There are many other tricks developed
q For details, see J. Wang, et al,, “CLOSET+:
Searching for the Best Strategies for Mining
Frequent Closed Itemsets”, KDD'03
35
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

q Basic Concepts

q Efficient Pattern Mining Methods

q Pattern Evaluation

q Summary

36
Pattern Evaluation

q Limitation of the Support-Confidence Framework

q Interestingness Measures: Lift and χ2

q Null-Invariant Measures

q Comparison of Interestingness Measures

37
How to Judge if a Rule/Pattern Is Interesting?
q Pattern-mining will generate a large set of patterns/rules
q Not all the generated patterns/rules are interesting
q Interestingness measures: Objective vs. subjective
q Objective interestingness measures
q Support, confidence, correlation, …
q Subjective interestingness measures:
q Different users may judge interestingness differently

q Let a user specify

q Query-based: Relevant to a user’s particular request
q Judge against one’s knowledge-base
q unexpected, freshness, timeliness
38
Limitation of the Support-Confidence Framework
q Are s and c interesting in association rules: “A Þ B” [s, c]? Be careful!
q Example: Suppose one school may have the following statistics on #
of students who may play basketball and/or eat cereal:
play-basketball not play-basketball sum (row)
eat-cereal 400 350 750 2-way
contin
not eat-cereal 200 50 250 gency tab
le
sum(col.) 600 400 1000

q Association rule mining may generate the following:

q play-basketball Þ eat-cereal [40%, 66.7%] (higher s & c)
q But this strong association rule is misleading: The overall % of
students eating cereal is 75% > 66.7%, a more telling rule:
q ¬ play-basketball Þ eat-cereal [35%, 87.5%] (high s & c)
39
Interestingness Measure: Lift
q Measure of dependent/correlated events: lift Lift is more telling than s & c
c( B® C ) s( BÈ C ) B ¬B ∑row
lift ( B, C ) = =
s(C ) s( B) ´ s(C ) C 400 350 750
¬C 200 50 250
q Lift(B, C) may tell how B and C are correlated ∑col. 600 400 1000

q Lift(B, C) = 1: B and C are independent

q > 1: positively correlated
q < 1: negatively correlated
400 / 1000
q For our example, lift ( B, C ) = = 0.89
600 / 1000 ´ 750 / 1000
200 / 1000
lift ( B, ¬C ) = = 1.33
600 / 1000 ´ 250 / 1000

q Thus, B and C are negatively correlated since lift(B, C) < 1;

q B and ¬C are positively correlated since lift(B, ¬C) > 1
40
Interestingness Measure: χ2
q Another measure to test correlated events: χ2 B ¬B ∑row

(Observed - Expected ) 2 C 400 (450) 350 (300) 750

c =å
2
¬C 200 (150) 50 (100) 250
Expected
∑col 600 400 1000
q For the table on the right,
2
χ =
(400 − 450)2 (350 − 300)2 (200 −150)2 (50 −100)2
+ + + = 55.56 Expected value
450 300 150 100
Observed value
q By consulting a table of critical values of the χ2 distribution, one can
conclude that the chance for B and C to be independent is very low
(< 0.01)
q χ2-test shows B and C are negatively correlated since the expected
value is 450 but the observed is only 400
q Thus, χ2 is also more telling than the support-confidence framework
41
Lift and χ2 : Are They Always Good Measures?
B ¬B ∑row
q Null transactions: Transactions that contain
C 100 1000 1100
neither B nor C ¬C 1000 100000 101000
∑col. 1100 101000 102100
q Let’s examine the new dataset D
null transactions
q BC (100) is much rarer than B¬C (1000) and ¬BC
(1000), but there are many ¬B¬C (100000) Contingency table with expected values added

q Unlikely B & C will happen together! B ¬B ∑row

C 100 (11.85) 1000 1100
q But, Lift(B, C) = 8.44 >> 1 (Lift shows B and C are ¬C 1000 (988.15) 100000 101000
strongly positively correlated!) ∑col. 1100 101000 102100

q χ2 = 670: Observed(BC) >> expected value (11.85)

q Too many null transactions may “spoil the soup”!
42
Interestingness Measures & Null-Invariance
q Null invariance: Value does not change with the # of null-transactions
q A few interestingness measures: Some are null invariant

Χ2 and lift are not

null-invariant

Jaccard, consine,
AllConf, MaxConf,
and Kulczynski
are null-invariant
measures

43
Null Invariance: An Important Property
q Why is null invariance crucial for the analysis of massive transaction data?
q Many transactions may contain neither milk nor coffee!

milk vs. coffee contingency table q Lift and c2 are not null-invariant: not good to
evaluate data that contain too many or too
few null transactions!
q Many measures are not null-invariant!
Null-transactions
w.r.t. m and c

44
Comparison of Null-Invariant Measures
q Not all null-invariant measures are created equal
q Which one is better? 2-variable contingency table
q D4—D6 differentiate the null-invariant measures
q Kulc (Kulczynski 1927) holds firm and is in balance of
both directional implications
All 5 are null-invariant

Subtle: They disagree on those cases

45
Analysis of DBLP Coauthor Relationships
qDBLP: Computer science research publication bibliographic database
q > 3.8 million entries on authors, paper, venue, year, and other information

Advisor-advisee relation: Kulc: high, Jaccard: low,

cosine: middle
q Which pairs of authors are strongly related?
q Use Kulc to find Advisor-advisee, close collaborators
46
Imbalance Ratio with Kulczynski Measure
q IR (Imbalance Ratio): measure the imbalance of two itemsets A and B in
rule implications:

qKulczynski and Imbalance Ratio (IR) together present a clear picture for all
the three datasets D4 through D6
q D4 is neutral & balanced; D5 is neutral but imbalanced
q D6 is neutral but very imbalanced

47
What Measures to Choose for Effective Pattern Evaluation?
q Null value cases are predominant in many large datasets
q Neither milk nor coffee is in most of the baskets; neither Mike nor Jim is an author
in most of the papers; ……
q Null-invariance is an important property
q Lift, χ2 and cosine are good measures if null transactions are not predominant
q Otherwise, Kulczynski + Imbalance Ratio should be used to judge the
interestingness of a pattern
q Exercise: Mining research collaborations from research bibliographic data
q Find a group of frequent collaborators from research bibliographic data (e.g., DBLP)
q Can you find the likely advisor-advisee relationship and during which years such a
relationship happened?
q Ref.: C. Wang, J. Han, Y. Jia, J. Tang, D. Zhang, Y. Yu, and J. Guo, "Mining Advisor-
Advisee Relationships from Research Publication Networks", KDD'10
48
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

q Basic Concepts

q Efficient Pattern Mining Methods

q Pattern Evaluation

q Summary

49
Summary
q Basic Concepts
q What Is Pattern Discovery? Why Is It Important?
q Basic Concepts: Frequent Patterns and Association Rules
q Compressed Representation: Closed Patterns and Max-Patterns
q Efficient Pattern Mining Methods
q The Downward Closure Property of Frequent Patterns
q The Apriori Algorithm
q Extensions or Improvements of Apriori
q Mining Frequent Patterns by Exploring Vertical Data Format
q FPGrowth: A Frequent Pattern-Growth Approach
q Mining Closed Patterns
q Pattern Evaluation
q Interestingness Measures in Pattern Mining
q Interestingness Measures: Lift and χ2
q Null-Invariant Measures
q Comparison of Interestingness Measures
50
Recommended Readings (Basic Concepts)
q R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of
items in large databases”, in Proc. of SIGMOD'93
q R. J. Bayardo, “Efficiently mining long patterns from databases”, in Proc. of
SIGMOD'98
q N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets
for association rules”, in Proc. of ICDT'99
q J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent Pattern Mining: Current Status and
Future Directions”, Data Mining and Knowledge Discovery, 15(1): 55-86, 2007

51
Recommended Readings (Efficient Pattern Mining Methods)
q R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, VLDB'94
q A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large
databases”, VLDB'95
q J. S. Park, M. S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules”,
SIGMOD'95
q S. Sarawagi, S. Thomas, and R. Agrawal, “Integrating association rule mining with relational database
systems: Alternatives and implications”, SIGMOD'98
q M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “Parallel algorithm for discovery of association
rules”, Data Mining and Knowledge Discovery, 1997
q J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation”, SIGMOD’00
q M. J. Zaki and Hsiao, “CHARM: An Efficient Algorithm for Closed Itemset Mining”, SDM'02
q J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best Strategies for Mining Frequent Closed
Itemsets”, KDD'03
q C. C. Aggarwal, M.A., Bhuiyan, M. A. Hasan, “Frequent Pattern Mining Algorithms: A Survey”, in
Aggarwal and Han (eds.): Frequent Pattern Mining, Springer, 2014
52
Recommended Readings (Pattern Evaluation)
q C. C. Aggarwal and P. S. Yu. A New Framework for Itemset Generation. PODS’98
q S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing
association rules to correlations. SIGMOD'97
q M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding
interesting rules from large sets of discovered association rules. CIKM'94
q E. Omiecinski. Alternative Interest Measures for Mining Associations. TKDE’03
q P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for
Association Patterns. KDD'02
q T. Wu, Y. Chen and J. Han, Re-Examination of Interestingness Measures in Pattern
Mining: A Unified Framework, Data Mining and Knowledge Discovery, 21(3):371-397,
2010

Analysis of Adventure Tourist's Motivations in Sri Lanka
100% (3)
Analysis of Adventure Tourist's Motivations in Sri Lanka
35 pages
Practice Questions (Closed) : Asme Section Viii, Div. I
50% (6)
Practice Questions (Closed) : Asme Section Viii, Div. I
30 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
Chap4-PatternMiningBasic
No ratings yet
Chap4-PatternMiningBasic
52 pages
38 GM_ASAP-Association Rule Mining
No ratings yet
38 GM_ASAP-Association Rule Mining
64 pages
Unit2 Apriori FP Growth
No ratings yet
Unit2 Apriori FP Growth
27 pages
06 FPBasic
No ratings yet
06 FPBasic
74 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
dm 2
No ratings yet
dm 2
71 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
06 FPBasic
No ratings yet
06 FPBasic
37 pages
Week 3
No ratings yet
Week 3
56 pages
06Apriori Edited v3
No ratings yet
06Apriori Edited v3
29 pages
06 Apriori
No ratings yet
06 Apriori
36 pages
DM-BS-lec6-Mining Frequent Patterns
No ratings yet
DM-BS-lec6-Mining Frequent Patterns
37 pages
Updated Module 3
No ratings yet
Updated Module 3
31 pages
06 Association Rule Mining
No ratings yet
06 Association Rule Mining
20 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
67 pages
Mining Frequent Patterns and Associations
No ratings yet
Mining Frequent Patterns and Associations
52 pages
Association
No ratings yet
Association
40 pages
Concepts and Techniques: - Chapter 6
No ratings yet
Concepts and Techniques: - Chapter 6
64 pages
FP Tree Basics
No ratings yet
FP Tree Basics
67 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
Unit 3
No ratings yet
Unit 3
62 pages
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
No ratings yet
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
66 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
65 pages
Frequent Pattern Based Clustering Methods
No ratings yet
Frequent Pattern Based Clustering Methods
23 pages
DWDWM Unit2
No ratings yet
DWDWM Unit2
59 pages
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
No ratings yet
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
26 pages
Module 3
No ratings yet
Module 3
136 pages
Association Rules
No ratings yet
Association Rules
48 pages
Slides 06FPBasic
No ratings yet
Slides 06FPBasic
30 pages
06 FPBasic
No ratings yet
06 FPBasic
65 pages
M9 Asosiasi
No ratings yet
M9 Asosiasi
58 pages
Chapter - 6 Data Mining
No ratings yet
Chapter - 6 Data Mining
65 pages
Frequent Itemset Mining
No ratings yet
Frequent Itemset Mining
58 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
chap 4-Mining Frequent Patterns, Association-Lecture 6-2
No ratings yet
chap 4-Mining Frequent Patterns, Association-Lecture 6-2
66 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
04 FPbasic
No ratings yet
04 FPbasic
78 pages
VIPDMTheoryChapter 5
No ratings yet
VIPDMTheoryChapter 5
96 pages
Httpsmygju.gju.Edu.jofacescourse Portfoliocourse Syllabuscourse Syllabus.xhtml 2
No ratings yet
Httpsmygju.gju.Edu.jofacescourse Portfoliocourse Syllabuscourse Syllabus.xhtml 2
15 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
Powerpoint Presentation On Somlething
No ratings yet
Powerpoint Presentation On Somlething
181 pages
6a - Frequent Pattern Analysis
No ratings yet
6a - Frequent Pattern Analysis
13 pages
Chapter06 (Frequent Patterns)
No ratings yet
Chapter06 (Frequent Patterns)
47 pages
Chapter 5 Topic 1
No ratings yet
Chapter 5 Topic 1
15 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
06 FPBasic
No ratings yet
06 FPBasic
59 pages
DWDM - Unit - IV
No ratings yet
DWDM - Unit - IV
67 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
93 pages
KDDM-Lecture 3
No ratings yet
KDDM-Lecture 3
21 pages
Unit-2
No ratings yet
Unit-2
65 pages
Lect 6
No ratings yet
Lect 6
74 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
Module 3
No ratings yet
Module 3
98 pages
Fundamentals of Data Science Unit 5
No ratings yet
Fundamentals of Data Science Unit 5
25 pages
Crochet Slouchy Hats and Beanies: 14 Quick and Easy Patterns
From Everand
Crochet Slouchy Hats and Beanies: 14 Quick and Easy Patterns
Julie King
No ratings yet
BTMSB2104618 PT Wasco Enginnering (JGC)
100% (1)
BTMSB2104618 PT Wasco Enginnering (JGC)
1 page
61a38dbd87c2b - Electronic Transactions Act English Notes
No ratings yet
61a38dbd87c2b - Electronic Transactions Act English Notes
8 pages
Health Sector - Bulletin - October - 2023
No ratings yet
Health Sector - Bulletin - October - 2023
10 pages
"Management of Waste From Dairy Farms": Central Pollution Control Board
No ratings yet
"Management of Waste From Dairy Farms": Central Pollution Control Board
11 pages
20_English Grammar - Conjunctions - (Practise Exercise)
No ratings yet
20_English Grammar - Conjunctions - (Practise Exercise)
190 pages
Faq Eee
No ratings yet
Faq Eee
5 pages
BLACK BOX Doc Report 22
No ratings yet
BLACK BOX Doc Report 22
55 pages
Nasal Cavity
No ratings yet
Nasal Cavity
4 pages
Govind Sharma Food Fest Project
No ratings yet
Govind Sharma Food Fest Project
19 pages
pdfcoffee.com_salaat-the-kundalini-yoga-of-islam-pdf-free.en.ar
No ratings yet
pdfcoffee.com_salaat-the-kundalini-yoga-of-islam-pdf-free.en.ar
8 pages
Experiment 2: Fourier Series and Fourier Transform: I. Objectives
No ratings yet
Experiment 2: Fourier Series and Fourier Transform: I. Objectives
8 pages
کئی چاند تھے سر آسماں تہذیبی مطالعہ از خرم یاسین
No ratings yet
کئی چاند تھے سر آسماں تہذیبی مطالعہ از خرم یاسین
9 pages
Acivity 1-Testing A Fixed Resistor
No ratings yet
Acivity 1-Testing A Fixed Resistor
4 pages
JEE Main 2025 Session 1_Results PDF_90%ile+
No ratings yet
JEE Main 2025 Session 1_Results PDF_90%ile+
19 pages
Marketing Communications Manager in Charlotte North Carolina Resume Bryan Lindler
No ratings yet
Marketing Communications Manager in Charlotte North Carolina Resume Bryan Lindler
2 pages
ROM - Romania
No ratings yet
ROM - Romania
24 pages
YD128 - KUKDO Epoxy
No ratings yet
YD128 - KUKDO Epoxy
3 pages
6 - A_E Embedded Circuit Boards - v38
No ratings yet
6 - A_E Embedded Circuit Boards - v38
61 pages
Quill The Keep On The Borderlands
No ratings yet
Quill The Keep On The Borderlands
6 pages
Foundation Plan
No ratings yet
Foundation Plan
1 page
Generalized Additive Model
No ratings yet
Generalized Additive Model
10 pages
4DPM Deepfake Detection With A Denoising Diffusion Probabilistic Mask
0% (1)
4DPM Deepfake Detection With A Denoising Diffusion Probabilistic Mask
5 pages
Biographical Profile Form
No ratings yet
Biographical Profile Form
25 pages
Case Study - Villa Maria
No ratings yet
Case Study - Villa Maria
4 pages
Platform 3750
100% (1)
Platform 3750
51 pages
FCE Listening Practice Test 19 Printable
No ratings yet
FCE Listening Practice Test 19 Printable
8 pages
DP Lv1 CHEM Mid - Term Test Announcement.: Chanhw@ust - HK
No ratings yet
DP Lv1 CHEM Mid - Term Test Announcement.: Chanhw@ust - HK
2 pages
PLDT SOA - JUNE - Unlocked
No ratings yet
PLDT SOA - JUNE - Unlocked
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

P8 FPBasic

Uploaded by

P8 FPBasic

Uploaded by

Konsep Data Mining

Pertemuan 8 | Mining Frequent Patterns, Association and

q Efficient Pattern Mining Methods

q What Is Pattern Discovery? Why Is It Important?

q Basic Concepts: Frequent Patterns and Association Rules

q Compressed Representation: Closed Patterns and Max-Patterns

q Let σ = 50% (σ: minsup threshold) 30 Beer, Diaper, Eggs

q Calculation: c = sup(X È Y) / sup(X) Note: X È Y: the union of two itemsets

q such that, s ≥ minsup and c ≥ minconf 30 Beer, Diaper, Eggs

(Q: Are these all rules?)

q Efficient Pattern Mining Methods

q The Apriori Algorithm

q Mining Frequent Patterns by Exploring Vertical Data Format

q FPGrowth: A Frequent Pattern-Growth Approach

q Mining Closed Patterns

TDB1 + TDB2 + ... + TDBk = TDB

q Method: Scan DB twice (A. Savasere, E. Omiecinski and S. Navathe, VLDB’95)

q {ab, ad, ce} …… …

Item Frequency Header f:4 c:1 Item Conditional database

{} {} {} {} Actually, for single branch FP-tree, all the

f2 f3 f4 g h f4-proj. DB f3-proj. DB f4-proj. DB f3-proj. DB

f3 f4 i j Assume only f’s are f2 f3 f2 f2 f3 f1

q Efficient Pattern Mining Methods

q Limitation of the Support-Confidence Framework

q Interestingness Measures: Lift and χ2

q Comparison of Interestingness Measures

q Let a user specify

q Association rule mining may generate the following:

q Lift(B, C) = 1: B and C are independent

q Thus, B and C are negatively correlated since lift(B, C) < 1;

(Observed - Expected ) 2 C 400 (450) 350 (300) 750

q Unlikely B & C will happen together! B ¬B ∑row

q χ2 = 670: Observed(BC) >> expected value (11.85)

Χ2 and lift are not

Subtle: They disagree on those cases

Advisor-advisee relation: Kulc: high, Jaccard: low,

q Efficient Pattern Mining Methods

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.