0% found this document useful (0 votes)
6 views

DM Lect7

The document discusses frequent pattern mining and association rule mining. It defines key concepts like frequent itemsets, support, confidence and covers algorithms like Apriori for generating candidate itemsets and determining frequent patterns. The goal is to find inherent relationships within transactional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

DM Lect7

The document discusses frequent pattern mining and association rule mining. It defines key concepts like frequent itemsets, support, confidence and covers algorithms like Apriori for generating candidate itemsets and determining frequent patterns. The goal is to find inherent relationships within transactional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DATA MINING

Association Rules
Lec 7

Mohammed
Taiz University
Outlines
• Basic Concepts
• Frequent Itemset Mining Methods
What Is Frequent Pattern Analysis?
• Frequent pattern
• a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and
association rule mining
• Finding frequent patters plays an essential role in mining association, correlation, and many
other interesting relationships among data.

• Motivation: Finding inherent regularities in data


• What products were often purchased together? — Beer and diapers?!

• What are the subsequent purchases after buying a PC?

• What kinds of DNA are sensitive to this new drug?

• Can we automatically classify web documents?

• Applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click
stream) analysis, and DNA sequence analysis
Market Basket Analysis
Basic Concepts: Transactional Data
• Market basket example:
• Basket 1:{}
• Basket 2:{}
• Basket 3:{}
• ….
• Basket n:{}

• Definitions:
• An item : an article in a basket, or an attribte-value pair.
• A transaction: items purchased in a basket.
• A transactional dataset: A set of transactions.
Basic Concepts: Frequent Patterns
Transaction-id Items bought
• Itemset
10 A, B, D
• A set of one or more items
• e.g.,{A,B,D} is an itemset. 20 A, C, D
• k -itemset X = {x1, …, xk} is an itemset with k items. 30 A, D, E
• (absolute) support , or,support count of X
• Frequency or occurrence of an itemset X 40 B, E, F
• e.g., A=3 50 B, C, D, E, F

• (relative) support ,s , the fraction of transactions that contains X


• i.e., the probability that a transaction contains X
• i.e.,A = 60%
• An itemset X isfrequent if X’s support is no less than a minsup threshold
Basic Concepts: Association Rules
• Find all the rules X  Y with minimum Transaction-id Items bought
support and confidence 10 A, B, D
• support, s , probability that a transaction
contains X  Y 20 A, C, D
• confidence, c, conditional probability 30 A, D, E
that a transaction having X also contains
Y 40 B, E, F
50 B, C, D, E, F
• Let sup = 50%, conf = 50%
m in m in

Custome Customer
• Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3 }
r buys A
• Association rules: buys
• A  D (60%, 100%) both
• 60% of all transactions show that A and
D are purchased together
• 100% of the costumers who purchased
Customer
A also bought D
buys D
• D  A (60%, 75%)
Basic Concepts: Association Rules
• An association rule is about prelateships between
disjoint itemsets X and (Y)
• It presents the pattern when X occur, Y also
occurs

• Association rules do not represent an sort of


causality or correlation between the two itemsets.
• X =>Y does not mean X causes Y, s no Causality
• X => Y can be different from Y=>X, unlike correlation

• Association rules assist in marketing, targeted


advertising …..etc.
Basic Concepts: Support and Conf idence
• Find all the rulesX  Y with minimum support and
confidence.
• Support. S, probability that a transaction contains X  Y
• P( X  Y) = support(X,Y)/support(X) = count(XY)/count(X).

•A rule is strong if its support and confidence are no less


than a minimal support and confidence thresholds

• Strong is not
necessary interesting (proper choose f
threshold is necessary )
Association rule mining
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent
itemset

• Because the second step is less costly than the


first, the over all performance is measured by
the first step.

• Frequent itemset generation is still


computationally expensive
Mining Frequent Itemsets task
• Input: A set of transactions T, over a set of items I
• Output: All itemsets with items in I having
• support ≥ minsup threshold

• Problem parameters:
• N = |T|: number of transactions
• d = |I|: number of (distinct) items
• w: max width of a transaction
• Number of possible itemsets?

• Scale of the problem:


• WalMart sells 100,000 items and can store billions of baskets.
• The Web has billions of words and many billions of pages.
The itemset lattice

Given d items, there are 2d possible


itemsets
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns
• e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub
-patterns!

• Solution
• Mine closed patterns and max-patterns instead
Closed Frequent Itemset
• An itemset X is closed if X isfrequent and there existsno super-pattern Y ‫כ‬
X,with the same support as X
• An itemset is closed if none of its immediate supersets has tha same support
as the itemset.
• Closed pattern is a lossless compression of frequent patterns.
• Reducing the number of patterns and rules.
T ID Ite m s Ite m s e t S u p p o rt Ite m s e t S u p p o rt

1 {A , B } {A } 4 {A , B , C } 2

2 {B , C , D } {B } 5 {A , B , D } 3
{C } 3 {A , C , D } 2
3 {A , B , C , D }
{D } 4 {B , C , D } 3
4 {A , B , D }
{A , B } 4 {A , B , C , D } 2
5 {A , B , C , D }
{A , C } 2
{A , D } 3
{B , C } 3
{B , D } 4
{C , D } 3
Maximal Frequent Itemset
• An itemset X is a max-pattern if X is frequent and there existsno frequent
super-pattern Y ‫ כ‬X.
• An itemset is maximal frequent if none of its immediate supersets is frequent.

Infrequent
Itemsets
Maximal Frequent Itemset

Maximal
Itemsets

Infrequent
Itemsets
Maximal vs Closed Itemsets
Frequent Itemset Mining Methods
• Scalable mining methods: Three major approaches
• Apriori (Agrawal & Srikant@VLDB’94)
• Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
• Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
Apriori: A Candidate Generation-and-Test
Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:
• Initially, scan DB once to get frequent 1-itemset
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Test the candidates against DB
• Terminate when no frequent or candidate set can be
generated
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Database TDB Itemset sup
{A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
The Apriori Algorithm
• Pseudo-code:
Ck : Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1;Lk !=;k ++) do begin
Ck+1 = candidates generated from Lk ;
for each transactiont in database do
increment the count of all candidates inCk+1
that are contained int
Lk+1 = candidates inCk+1 with min_support
end
return k Lk ;
Important Details of Apriori
• How to generate candidates?
• Step 1: self-joiningLk
• Step 2: pruning
• How to count supports of candidates?
• Example of Candidate-generation
• L3 = {abc, abd, acd, ace, bcd }
• Self-joining:L3 *L3
• abcd fromabc andabd
• acde fromacd andace
• Pruning:
• acde is removed becauseade is not inL3
• C4 ={abcd }
Improving the ef ficiency of Aprior
• Bottlenecks of the Aprior Approach:
• Candidate generation and test:
• Often generate a huge number of candidates.
• It is costly to repeatedly scan the whole database.

• Improving Apriori: general ideas

• Reduce passes of transaction database scans

• Shrink number of candidates (sampling)

• Facilitate support counting of candidates (hash table)


Which Patterns Are Interesting ?
• Strong Rules are not necessarily interesting
• Analyzing transactions at ALLELectronics:
• Purchase of computer games and videos
• 10000 transactions
• 6000  include computer games
• 7500  include videos
• 4000  include both

• Min support = 30%, min confidence = 60


• Buys (X,”computer games”)  buys (X,”videos
”)[Support = 40%, Confidence = 66%]

• Is it interesting?

= 0.89
ANY QUESTIONS

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy