Data Mining - Module 6
Data Mining - Module 6
Data Mining - Module 6
Province of Cotabato
Municipality of Makilala
MAKILALA INSTITUTE OF SCIENCE AND TECHNOLOGY
Makilala, Cotabato
III. REFERENCES
Main Textbook
Tan, Steinbach, Karpatne, Kumar (2019). Introduction to Data Mining 2nd Edition.
Han, J., Kamber, M. & Pei, J. (2013). Data Mining Concepts and Techniques. 3rd Edition
I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal (2016) Data Mining: Practical Machine Learning Tools
and Techniques. 4TH Edition
Apriori principle holds due to the following property of the support measure:
∀X ,Y : (X ⊆ Y)⇒ s(X ) ≥ s(Y)
- Support of an itemset never exceeds the support of its subsets
- This is known as the anti-monotone property of support
STEP 6
N The
cand
For every nonempty O idate
subset s of 1, output the YE
STEPset
5=
rule “s=>(1-s)” if Null S
For each frequent
confidence C of the rule itemset 1, generate
“s=>(1-s)” (=support s of all nonempty
1/support S of s)’ subsets of 1
Lesson 3: Market Basket Analysis
min_conf
Provides insight into which products tend to be purchased together and which are most amenable to
promotion.
Actionable rules
Trivial rules
- People who buy chalk-piece also buy duster
Inexplicable
- People who buy mobile also buy bag
1. Let k = 1
2. Generate frequent itemsets of length 1
3. Repeat until no new frequent
itemsets are identified
1. Generate length (k+1) candidate
itemsets from length k frequent Itemsets
2. Prune candidate itemsets containing subsets
of length k that are infrequent
- How many k-itemsets contained
in a (k+1)-itemset?
3. Count the support of each candidate
by scanning the DB
4. Eliminate candidates that are infrequent,
5. leaving only those that are frequent
Note: steps 3.2 and 3.4 prune itemsets that are infrequent
E.g. Merge {Bread, Milk} with {Bread, Diaper} to get {Bread, Diaper, Milk}
1. Lexicographically ordered!
2. Merge (x1, x2, …, xk-1) with (y1, y2, …, yk-1),
if x1 = y1, x2 = y2, …, xk-2 = yk-2
Candidate 4-itemsets:
(A B C D) OK because of (A B C), (A B D),
(A C D), (B C D)
(A C D E) Not OK because of (C D E)
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
TABLE-2
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
TABLE-3
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset. From TABLE-1, find out the occurrences of 2-itemset.
TABLE-4
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus it is deleted.
TABLE-5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1, find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in TABLE-5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as it is not occurring in
TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.
TABLE-6
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%
Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%
Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%
This shows that all the above association rules are strong if minimum confidence threshold is 60%.
Example 2:
Example 4:
Disadvantages
Algorithm can be very slow and bottleneck is candidate generation.
Assumes transaction database is memory resident
Requires many database scans
(Apply Apriori Algorithm in the given data sets below. Follow the steps to perform Apriori Algorithm and consider the
market basket analysis then generate frequent itemset.)
TID Biscuit Bread Cheese Coffee Yogurt Cereal Chocolate Donuts Juice Milk Tea Eggs NewsPaper Pastry Rools Sugar Count
1 1 1 1 1 1 5
2 1 1 1 1 4
3 1 1 1 1 1 5
4 1 1 1 1 1 5
5 1 1 1 1 1 5
6 1 1 2
7 1 1 1 1 1 5
8 1 1 1 3
9 1 1 1 1 1 5
10 1 1 1 1 1 5
11 1 1 1 3
12 1 1 1 1 1 5
13 1 1 1 3
14 1 1 1 1 1 5
15 1 1 2
16 1 1
17 1 1 1 3
18 1 1 1 1 4
19 1 1 1 1 1 5
20 1 1 1 1 4
21 1 1 1 3
22 1 1 1 1 4
23 1 1 1 1 1 5
24 1 1 1 3
25 1 1 1 3
Count 4 13 12 9 2 9 9 10 11 6 4 2 2 1 2 1