Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir
Data Mining
Association Rules
Data Mining
Exercises
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent
itemsets and association rule mining Motivation: Finding inherent regularities in data
What products were often purchased together? cheese and chips?!
Applications
Basket data analysis, cross-marketing, catalog design, Web log (click stream) analysis, and DNA
sequence analysis.
Transaction-id 10
Items bought A, B, D
Itemset X = {x1, , xk} Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y
20
30 40 50
A, C, D
A, D, E B, E, F B, C, D, E, F
Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Customer buys cheese
An itemset X is closed if X is frequent and there exists no superpattern Y X, with the same support as X
sup 2 3 3 1 3
Itemset
sup
L1
{A}
{B} {C} {E}
2
3 3 3
C2 L2
Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2
sup 1 2 1 2 3 2
C2 2nd scan
C3
Itemset
{B, C, E}
3rd
scan
L3
Itemset {B, C, E}
sup 2
11
12
The final frequent item sets are those remaining in L2 and L3. However, {2,3}, {2,5}, and {3,5} are all contained in the larger item set {2, 3, 5}. Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5}. These are the only item sets from which we will generate association rules.
13
For each frequent itemset, f, generate all non-empty subsets of f For every non-empty subset s of f do if support(f)/support(s) min_confidence then output rule s ==> (f-s) end
14
Rule {2}{5}
Conf.
3/3 = 1.00
{3}{1}
2/3 = 0.67
{2,5}{3}
{3,5}{2} {2}{3,5} {3}{2,5} {5}{2,3}
2/3 = 0.67
2/2 = 1.00 2/3 = 0.67 2/3 = 0.67 2/3 = 0.67
{2}{3}
{3}{2} {3}{5} {5}{2} {5}{3}
2/3 = 0.67
2/3 = 0.67 2/3 = 0.67 3/3 = 1.00 2/3 = 0.67
Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}{3}, {3,5}{2}, {5}{2} and {2}{5}
15
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 contained in t that are
abcd from abc and abd acde from acd and ace
Pruning:
17
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
18
19
2. Sort frequent items in frequency descending order 3. Scan DB again, construct FP-tree
21
Compactness
Reduce irrelevant infoinfrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)
22
Method
For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent pattern
23
24
100 90 80 70
Run time(sec.)
25
different minimum support thresholds across multilevels lead to different algorithms (e.g., decrease min-support at lower levels)
27
Handling quantitative rules may require mapping of the continuous variables into Boolean
28
29
Term Associations
Find associations among words based on their occurrences in documents similar to above, but invert the table (terms as items, and docs as transactions)
business capital fund . . . invest Doc 1 5 2 0 . . . 6 Doc 2 5 4 0 . . . 0 Doc 3 2 3 0 . . . 0 . . . . . . . . .. .. .. .. .. .. .. .. Doc n 1 5 1 . . . 3
30
Examples
60% of clients who accessed /products/, also accessed /products/software/webminer.htm 30% of clients who accessed /special-offer.html, placed an online order in /products/software/ Actual Example from IBM official Olympics Site:
Association Rule
90
97.2
3.17 /PUBLIC/product-info/T3E ===> /PUBLIC/product-info/T3E/CRAY_T3E.html 0.14 /PUBLIC/product-info/J90/J90.html, /PUBLIC/product-info/T3E ===> /PUBLIC/product-info/T3E/CRAY_T3E.html 0.15 /PUBLIC/product-info/J90, /PUBLIC/product-info/T3E/CRAY_T3E.html, /PUBLIC/product-info/T90, ===> /PUBLIC/product-info/T3E, /PUBLIC/sc.html
Design suggestions
32
1.
. ( )
2.
3.
33