An Efficient Algorithm For Mining

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 6

P.

NAGARJUN GUPTA RAJENDRA


04L31A0521 04L31A0557
VIGNAN IIT VIGNAN IIT
VISAKHAPATNAM VISAKHAPATNAM

CLOSET: An Efficient Algorithm for Mining


Frequent Closed Itemsets

ABSTRACT effectiveness of mining since users have to


Association mining may often derive sift through a large number of mined rules to
an undesirably large set of frequent itemsets find useful ones.
and association rules. Recent studies have There is an interesting alternative:
proposed an interesting alternative: mining instead of mining the complete set of
frequent closed itemsets and their frequent itemsets and their associations,
corresponding rules, which has the same association mining only their corresponding
power as association mining but rules. An important implication is that
substantially reduces the number of rules to mining frequent closed itemset has the same
be presented. power as mining the complete set of
An efficient algorithm CLOSET, for frequent itemsets, but it will substantially
mining closed itemsets, with the reduce redundant rules to be generated and
development of three techniques: increase both efficiency and effectiveness of
(1) Applying a compressed frequent mining.
pattern tree FP-tree structure for
mining closed itemsets without Definition: Association rule mining
candidate generation, searches for interesting relationships among
(2) Developing a single prefix path items in a given data set.
compression technique to identify
frequent closed itemsets quickly, and Interestingness Measures:
(3) Exploring a partition bases
projection mechanism for scalable Certainty: Each discovered pattern should
mining in large databases. CLOSET have a measure of certainty associated with
is efficient and scalable over lager it that assesses the validity or
databases and is faster than the “trustworthiness” of the pattern. A certainty
previously proposed methods. measure for association rules of the form
“A⇒B”, where A and B are sets of items, is
INTRODUCTION: Confidence. Given a set task-relevant data
tuples, the confidence of “is defined as
It has been well recognized that “A⇒B” is defined as
frequent pattern mining plays an essential
role in many important data mining tasks, Confidence (A⇒B) = # tuples
e.g. associations, sequential patterns , containing both A and B
episodes, partial periodicity, etc. However, it # tuples
is also well known that frequent pattern containing A
mining often generates a very large number
of frequent itemsets and rules, which Utility : The potential usefulness of a
reduces not only efficiency but also pattern is a factor defining its
interestingness. It can be estimated by a convention, Apriori assumes that items
utility function, such as support. The support within a transaction or itemset are sorted in
of an association pattern refers to the lexicographic order. The join, Lk-1  Lk-1,
percentage of task-relevant data tuples for is performed; where members of Lk-1 are
which the pattern is true. For association joinable if there first (k-2) items are in
rules of the form “A⇒B” where A and B are common. That is, members l1 and l2 are
sets of items, it is defined as joined if
(l1[1]=l2[1])∧(l1[2]=l2[2])∧(l1[3]=l2[3])∧(l1[4]=
Support(A⇒B) = # tuples containing l2[4]). The condition l1[k-1]<l2[k-1] simply
both A and B ensures that no duplicates are generated. The
total # of resulting itemset formed by joining l1 and l2
tuples is l1[1]l1[2]…l1[k-1]l2[k-1]

Association rules that satisfy both a user The prune step: Ck is a superset of Lk, that
specified minimum confidence and user is its members may or may not be frequent,
specified minimum support threshold are but all of the frequent, but all of the frequent
referred to as Strong Association Rules. k-itemsets are included in Ck. A scan of the
database to determine the count of each
Association Rule Mining is a two step candidate in Ck would result in the
process: determination of Lk. Ck, however can be
1. Find all frequent itemsets: each of huge, and so this could involve heavy
these itemsets will occur at least as computation. To reduce the size of Ck, the
frequently as a pre-determined Apriori property is used as follows. Any (k-
minimum support count. 1)-itemset that is not frequent cannot be a
2. Generate strong association rules subset of a frequent either and so can be
from the frequent itemsets: These removed from Ck. This subset testing can be
rules must satisfy minimum support done quickly by maintaining a hash tree of
and minimum confidence. all frequent itmesets.

The Apriori Algorithm: Finding Frequent Example:-


Itemsets Using Candidate Generation: TID List Of
Apriori is an influential algorithm for Item_IDs
mining frequent itemsets for Boolean T100 I1,I2,I5
association rules. The names of the T200 I2,I4
algorithm are based on the fact that the T300 I2,I3
algorithm uses prior knowledge of frequent T400 I1,I2,I4
itemset properties. Apriori employs an T500 I1,I3
iterative approach known as a level-wise T600 I2,I3
search where k-itemsets are used to explore T700 I1,I3
(k+1)-itemsets. The finding of each Lk T800 I1,I2,I3,I5
requires one full scan of the database. T900 I1,I2,I3
Two step process of finding frequent items: 1) In the first iteration of the algorithm,
each items is a member of the set of
The join step: To find Lk, a set of candidate candidate 1-itemsets, C1. The
k-itemsets is generated by joining Lk-1 with algorithm simply scans all of the
itself. This set of candidates is denoted Ck. transactions in order to count the
Let l1 and l2 be itemsets in Lk-1. The notation number of occurrence of each item.
li[j] refers to the jth item in li. By
2) Suppose that the minimum transactions containing the itemset A. Based
transactions support count required is on this equation, association rules can be
2. The set of frequent 1-itemsets, L1, generated as follows:
can then be determined. It consists of
the candidate 1-itemsets satisfying • for each frequent itemset l, generate
minimum support. all nonempty subsets of l.
3) To discover the set of frequent 2- • for every nonempty subset of l,
itemsets, L2 the algorithm uses L1 output the rule “s⇒(l-s)” if support
1X1 L1 to generate a candidate set of count(l) ≥ min_conf,
2-itemsets C2.
support
In this way we will find candidate sets count(s)
until a candidate set is null. where min_conf is the minimum
confidence threshold.

Since the rules are generated from frequent


itemsets, each one automatically satisfies
minimum support. Frequent itemsets can be
stored ahead of time in hash tables along
with their counts so that they can be
accessed quickly.

Eg:- Suppose the data contain the frequent


itemset l= {I1,I2,I5}. What are the
association rules that can be generated from
l?
The nonempty subsets of l are {I1, I2}, {I1,
I5}, {I2, I5}, {I1}, {I2} and {I5}. The
resulting association rules are as shown
below, each listed with its confidence.
I1∧I2⇒I5, confidence = 2/4 = 50%
I1∧I5⇒I2, confidence = 2/2 = 100%
Generating Association Rules from I2∧I5⇒I2, confidence = 2/2 = 100%
Frequent Itemsets
I1⇒I2∧I5, confidence = 2/6 = 33%
Once the frequent itemsets from transactions
in a database D have been found, it is I2⇒I1∧I5, confidence = 2/7 = 29%
straightforward to generate strong I5⇒I1∧I2, confidence = 2/2 = 100%
association rules from them. This can be If the minimum confidence threshold is, say,
done using the following equation for 70% then only the second, third and last
confidence. Where the conditional rules above are output, since these are the
probability is expressed in terms of itemset only ones generated that are strong.
support count.
Mining Frequent Itemsets without
Confidence (A⇒B) = support count (A U B) Candidate Generation
Support count (A)
The Apriori algorithm suffers from two non-
where support count (AUB) is the number of trivial costs:
transactions containing the itemsets AUB, 1) It may need to generate a huge
and support count (A) is the number of number of candidate sets.
2) It may need to repeatedly scan the common prefix is incremented by 1, and
database and checks large set of nodes for the item following the prefix are
candidates by pattern matching created and linked accordingly.
An interesting method that mines the
complete set of frequent itemsets without To facilitate tree traversal, an item
candidate generation is called frequent- header table is built so that each item points
pattern growth of simply FP-growth, to its occurrence in the tree via a chain of
which adopts a divide-and –conquer strategy node-links. The tree obtained after scanning
as follows: compress the database all of the transactions is
representing frequent items into a set of
conditional databases, each associated with
one frequent item, and mine each such
database separately.

Reexamine the mining of transaction


database, using the frequent-pattern growth
approach.

The first scan of the database is the same as


Apriori, which derives the set of frequent
items (1-items) and their support counts. Let
the minimum count be 2. The set of frequent
items is sorted in the order of descending
support count. This resulting set or list is
denoted L. Thus, we have L=[I2: 7,I1: 6,I3:
6,I4: 2,I5: 2]. Item Conditional Conditional Frequent
pattern FP-tree patterns
An FP-tree is then constructed as base generated
follows. First, create the root of the tree, I5 {(I2 I1: 1) , <I2:2, I1: I2 I5: 2,
labeled with “null”. Scan database D a (I2 I1 I3: 2> I1 I5: 2,
second time. The items in each transaction 1)} I2 I1 I5: 2
are processed in L order and a branch is
I4 {(I2 I1: 1) , <I2: 2> I2 I4: 2
created for each transaction. For example,
(I2: 1)}
the scan of the first transaction, “T100:I1,
I2, I5”, which contains three items (I2, I1,
The mining of the FP-tree proceeds
I5) in L order, leads to the construction of
as follows. Start from each frequent length-1
the first branch of the tree with three
pattern, construct its conditional pattern base
nodes:<(I2:1, (I1:1), (I5:1)>, where I2 is
(a sub database which consists of the set of
linked as child of the root, I1 is linked to I2
prefix paths in the FP-tree co-occurring with
and I5 is linked to I2. The second
the suffix pattern), then construct its
transaction, T200, contains the items I2 and
(conditional) FP-tree and perform mining
I4 in L order, which would result in a branch
recursively on such a tree. The pattern
where I2 is linked to the root and I4 is linked
growth is achieved by the concatenation
to I2. However this branch would share a
tree.
common prefix, (I2:2), with the existing
Let’s first consider I5 which is the
path for T100. Therefore, we instead
last item in L, rather than the first. The
increment the count of the I2 node when
reasoning behind this will become apparent
considering the branch to be added for a
as we explain the FP-tree mining process. I5
transaction, the count of each node along a
occurs in two branches of the FP-tree. The
paths formed by these branches are < (I2, I2, 1) There may exist a large number of
I5:1)> and < (I2, I1, I3, I5:1)>. Therefore frequent itemsets in a transaction
considering I5 as a suffix, its corresponding database, especially when the
two prefix paths are < (I2I1:1)> and < (I2, support threshold is low.
I1, I3:1) >, which form its conditional 2) There may exist a huge number of
pattern base. Its conditional FP-tree contains association rules. It is hard for users
only a single path, (I2:2, I1:2); I3 is not to comprehend and manipulate a
included because its support count of 1 is huge number of rules.
less than the minimum support count. The An interesting alternative to this problem is
single path generates all the combinations of the mining of frequent closed itemsets and
frequent patterns: I2 I5:2, I1 I5:2, I2 I1 I5:2. their corresponding association rules.

In the same way find the frequent Frequent closed itemset: An itemset X is a
itemsets for all other Items. The FP-growth closed itemset if there exist no itemset X’
method transforms the problem of finding such that (1) X’ is a proper superset of X
long frequent patterns to looking for shorter and (2) every transaction containing X also
ones recursively and then concatenating the contains X’. A closed itemset X is frequent
suffix. It uses the least frequent items as if its support passes the given support
suffix, offering good selectivity. The method threshold.
substantially reduces the search costs.
How to find the complete set of
frequent closed itemsets efficiently from
large database, which is called the frequent
closed itemset mining problem

CLOSET: An Efficient Algorithm


for Mining Frequent Closed
Itemsets

PROBLEM DEFINITION:

An itemset X is contained in
transaction <tid,Y> if X⊆ Y. Given a For the transaction database in table1 with
transaction database TDB, the support of an min_sup = 2, the divide and conquer method
itemset X, denoted as sup(X), is the number for mining frequent closed itemset.
of transactions in TDB which contain X. An
association rule R: X⇒Y is an implication 1) Find frequent items. Scan TDB to
between two itemsets X and Y where X, find the set of frequent items and
derive a global frequent item list,
Y⊂I and X∩Y =∅. The support of the rule,
called f_list, and f_list = {c:4, e:4,
denoted as sup(X⇒Y), is defined as sup
f:4, a:3, d:2}, where the items are
(XUY). The confidence of the rule, denoted
sorted in support descending order
as conf(X⇒Y), is defined as sup any infrequent item, such as b are
(XUY)/sup(X). omitted..
The requirement of mining the complete set 2) Divide search space. All the frequent
of association rules leads to two problems: closed itemsets can be divided into 5
non-overlap subsets based on the
f_list: (1) the ones containing items
d,(2) the ones containing item a but In the same way find the frequent
no d, (3) the ones containing item f closed itemsets for a, f, e, and c.
but no a not d, (4) the ones
containing e but no f, a nor d, and (5) 4) The set of frequent closed itemsets
the one containing only c. once all fund is {acdf :2, a :3, ae :2, cf :4, cef :3, e :
subsets are found, the complete set 4}
of frequent closed itemsets is done.
Optimization 1: Compress transactional
and conditional databases using FP-tree
structures. FP-tree compresses databases for
frequent itemset mining. Conditional
databases can be derived from FP-tree
efficiently.

Optimization 2: Extract items appearing in


every transaction of conditional databases.

Optimization 3: Directly extract frequent


closed itemsets from FP-tree.

Optimization 4: Prune search branches.

PERFORMANCE STUDY

Comparison of A-close, CHARM,


and CLOSET, CLOSET out performs both
3) Find subsets of frequent closed
CHARM and A-close. CLOSET is efficient
itemsets. The subsets of frequent
and scalable in mining frequent closed
closed itemsets can be mined by
itemsets in large databases. It is much faster
constructing corresponding
than A-close, and also faster than CHARM.
conditional database and mine each
recursively.
CONCLUSION
Find frequent closed itemsets
CLOSET leads to less and more
containing d. Only transaction
interesting association’s rules then the other
containing d are needed. The d-
previously proposed methods.
conditional database, denoted as
TDB|d, contains all the transactions
having d, which is {cefa, cfa}.
Notice that item d is omitted in each
transaction since it appears in every
transaction in the d-conditional
database.
The support of d is 2. Items c, f
and a appear twice respectively in
TDB|d. Therefore, cfad: 2 is a
frequent closed itemset. Since this
itemset covers every frequent items
in TDB|d finishes.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy