Data Mining - Module 6

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Republic of the Philippines

Province of Cotabato
Municipality of Makilala
MAKILALA INSTITUTE OF SCIENCE AND TECHNOLOGY
Makilala, Cotabato

COLLEGE OF TECHNOLOGY AND INFORMATION SYSTEMS


Bachelor of Science in Information Systems

Course Number : PROF EL 3 Instructor : RONALD L. BAJADOR


Course Description : DATA MINING Mobile Number: +639075182943
Credit Units : 3 units (3 hours lecture; 2hours laboratory) Email Address : rbajadorsharp@gmail.com
Module Number :6
Duration : 2 Weeks
I. LEARNING OUTCOMES

Upon completion of this material, you should be able to:


 discuss the overview of Apriori Algorithm and its key concepts
 determine the steps to perform Apriori Algorithm
 consider the market basket analysis
 generate frequent itemset in a given data set
 apply Apriori Algorithm in a given data set
 discuss the advantages and disadvantages of Apriori Algorithm

II. TOPIC(S) - Apriori Algorithm


Lesson 1: Apriori Background and its key concepts
Lesson 2: Steps to perform Apriori Algorithm
Lesson 3: Market Basket Analysis
Lesson 4: Frequent Itemset Generation
Lesson 5: Application of Apriori Algorithm
Lesson 6: Advantages and Disadvantages of Apriori Algorithm

III. REFERENCES

 Main Textbook

 Tan, Steinbach, Karpatne, Kumar (2019). Introduction to Data Mining 2nd Edition.

 Han, J., Kamber, M. & Pei, J. (2013). Data Mining Concepts and Techniques. 3rd Edition

 I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal (2016) Data Mining: Practical Machine Learning Tools
and Techniques. 4TH Edition

IV. COURSE CONTENT

Lesson 1: Apriori Background and It’s Key Concepts

 Proposed by Agrawal R, Imielinski T, Swami AN


- "Mining Association Rules between Sets of Items in Large Databases. “
- SIGMOD, June 1993
 The Apriori Algorithm is an influential algorithm for mining frequent item sets for Boolean association
rules.
 Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step
known as candidate generation, and groups of candidates are tested against the data.
 Apriori is designed to operate on database containing transactions (for example, collections of items
bought by customers, or details of a website frequentation).

PROFEL 3 – DATA MINING 1


 Frequent Itemsets: All the sets which contain the item the minimum support (denoted by Li for ith
itemset)
 Apriori Property: Any subset of frequent itemset must be frequent.
 Join Operation: To find Lk ,a set of candidate k-itemsets is generated by joining Lk – 1 with itself.

 Apriori principle holds due to the following property of the support measure:
∀X ,Y : (X ⊆ Y)⇒ s(X ) ≥ s(Y)
- Support of an itemset never exceeds the support of its subsets
- This is known as the anti-monotone property of support

Lesson 2: Steps to perform


STEPApriori
1 Algorithm
Scan the transaction data
base to get the support of S
each 1-itemset, compare S
with min_sup, and get a
STEP
support of 1-itemsets, L1
4

STEP 6
N The
cand
For every nonempty O idate
subset s of 1, output the YE
STEPset
5=
rule “s=>(1-s)” if Null S
For each frequent
confidence C of the rule itemset 1, generate
“s=>(1-s)” (=support s of all nonempty
1/support S of s)’ subsets of 1
Lesson 3: Market Basket Analysis
min_conf
 Provides insight into which products tend to be purchased together and which are most amenable to
promotion.
 Actionable rules
 Trivial rules
- People who buy chalk-piece also buy duster
 Inexplicable
- People who buy mobile also buy bag

Lesson 4: Frequent Itemset Generation

1. Let k = 1
2. Generate frequent itemsets of length 1
3. Repeat until no new frequent
itemsets are identified
1. Generate length (k+1) candidate
itemsets from length k frequent Itemsets
2. Prune candidate itemsets containing subsets
of length k that are infrequent
- How many k-itemsets contained
in a (k+1)-itemset?
3. Count the support of each candidate
by scanning the DB
4. Eliminate candidates that are infrequent,
5. leaving only those that are frequent

Note: steps 3.2 and 3.4 prune itemsets that are infrequent

PROFEL 3 – DATA MINING 2


 Generating Itemsets Efficiently
 How can we efficiently generate all (frequent) item sets at each iteration?
o Avoid generate repeated itemsets and infrequent itemsets
 Finding one-item sets easy
 Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets …
o If (A B) is frequent item set, then (A) and (B) have to be frequent item sets as well!

o ⇒ Compute k-item set by merging two (k-1)-itemsets. Which ones?


o In general: if X is a frequent k-item set, then all (k-1)-item subsets of X are also frequent

E.g. Merge {Bread, Milk} with {Bread, Diaper} to get {Bread, Diaper, Milk}

 Example: generating frequent itemsets


 Given: five frequent 3-itemsets (A B C) , (A B D) , (A C D) , (A C E) , (B C D)

1. Lexicographically ordered!
2. Merge (x1, x2, …, xk-1) with (y1, y2, …, yk-1),
if x1 = y1, x2 = y2, …, xk-2 = yk-2

 Candidate 4-itemsets:
(A B C D) OK because of (A B C), (A B D),
(A C D), (B C D)

(A C D E) Not OK because of (C D E)

3. Final check by counting instances in dataset!

Lesson 5: Apriori Algorithm Example

Example 1: Support threshold=50%, Confidence= 60%


TABLE-1
Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Count of Each Item

TABLE-2
Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

PROFEL 3 – DATA MINING 3


2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is deleted, only I1, I2, I3, I4 meet min_sup
count.

TABLE-3
Item Count

I1 4

I2 5

I3 4

I4 4

3. Join Step: Form 2-itemset. From TABLE-1, find out the occurrences of 2-itemset.

TABLE-4
Item Count

I1,I2 4

I1,I3 3

I1,I4 2

I2,I3 4

I2,I4 3

I3,I4 2

4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus it is deleted.

TABLE-5
Item Count

I1,I2 4

I1,I3 3

I2,I3 4

I2,I4 3

5. Join and Prune Step: Form 3-itemset. From the TABLE- 1, find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.

We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in TABLE-5 thus {I1, I2, I3} is frequent.

We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as it is not occurring in
TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.

TABLE-6
Item

I1,I2,I3

I1,I2,I4

I1,I3,I4

I2,I3,I4

Only {I1, I2, I3} is frequent.

PROFEL 3 – DATA MINING 4


6. Generate Association Rules: From the frequent itemset discovered above the association could be:
{I1, I2} => {I3}

Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%

{I1, I3} => {I2}

Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%

{I2, I3} => {I1}

Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%

{I1} => {I2, I3}

Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%

{I2} => {I1, I3}

Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%

{I3} => {I1, I2}

Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%

This shows that all the above association rules are strong if minimum confidence threshold is 60%.

Example 2:

PROFEL 3 – DATA MINING 5


Example 3:

Example 4:

Lesson 6: Advantages and Disadvantages of Apriori Algorithm


 Advantages

 Uses large itemset property


 Easily parallelized
 Easy to implement

 Disadvantages
 Algorithm can be very slow and bottleneck is candidate generation.
 Assumes transaction database is memory resident
 Requires many database scans

PROFEL 3 – DATA MINING 6


V. ACTIVITY/ EXERCISES/EVALUATION

(Apply Apriori Algorithm in the given data sets below. Follow the steps to perform Apriori Algorithm and consider the
market basket analysis then generate frequent itemset.)

1. Rule: Item/Items that is frequently purchased at least 50%

Transaction ID Items purchased


T1 {Apple, Mango, Pears}
T2 {Mango, Pears,Cabbage, Carrots}
T3 {Pears, Carrots, Mango}
T4 {Carrots, Mango}

2. Rule: Item/Items that is frequently purchased at least25%

TID Biscuit Bread Cheese Coffee Yogurt Cereal Chocolate Donuts Juice Milk Tea Eggs NewsPaper Pastry Rools Sugar Count
1 1 1 1 1 1 5
2 1 1 1 1 4
3 1 1 1 1 1 5
4 1 1 1 1 1 5
5 1 1 1 1 1 5
6 1 1 2
7 1 1 1 1 1 5
8 1 1 1 3
9 1 1 1 1 1 5
10 1 1 1 1 1 5
11 1 1 1 3
12 1 1 1 1 1 5
13 1 1 1 3
14 1 1 1 1 1 5
15 1 1 2
16 1 1
17 1 1 1 3
18 1 1 1 1 4
19 1 1 1 1 1 5
20 1 1 1 1 4
21 1 1 1 3
22 1 1 1 1 4
23 1 1 1 1 1 5
24 1 1 1 3
25 1 1 1 3
Count 4 13 12 9 2 9 9 10 11 6 4 2 2 1 2 1

PROFEL 3 – DATA MINING 7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy