0% found this document useful (0 votes)
69 views

Datamining & Cluster Coputing

This document discusses data mining and clustering algorithms. It provides background on data mining, defining it as the process of discovering useful patterns from large amounts of data. The document outlines the key stages in knowledge discovery from databases. It then focuses on clustering, describing it as the process of segmenting a database into subsets or clusters. Three clustering algorithms are discussed in detail: PAM (Partitioning Around Medoids), CLARA (Clustering Large Applications), and CLARANS (Clustering Large Applications based on Randomized Search). These algorithms use different approaches for partitioning data into clusters in an efficient manner for large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Datamining & Cluster Coputing

This document discusses data mining and clustering algorithms. It provides background on data mining, defining it as the process of discovering useful patterns from large amounts of data. The document outlines the key stages in knowledge discovery from databases. It then focuses on clustering, describing it as the process of segmenting a database into subsets or clusters. Three clustering algorithms are discussed in detail: PAM (Partitioning Around Medoids), CLARA (Clustering Large Applications), and CLARANS (Clustering Large Applications based on Randomized Search). These algorithms use different approaches for partitioning data into clusters in an efficient manner for large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 16

DATA MINING AND ASSOCIATION RULES

DEPARTMENTOFCOMPUTERSCIENCE&ENGINNERING

By
R. CHANDRA SEKHAR

(02121A0514)

V. PENCHALA PRAVEEN

(02121A0563)

DEPARTMENTOFCOMPUTERSCIENCE&ENGINNERING

SREEVIDYANIKETHANENGINEERINGCOLLEGE
SREESAINATHNAGAR,A.RANGAMPET,(A.P)517102

INDEX

1.Abstract
2.Introduction
2.1 What is data mining
2.2 Knowledge discovery in databases are
2.3 knowledge discovery in databases stages
3.Data Mining
3.1 Techniques
3.2 Applications
4.Clustering
4.1 Components of Clustering Task
4.2 Stages in Clustering
5. Clustering Techniques
6. Partitional Algorithms
7 k- medoid Algorithms
7.1 PAM (Partitioning Around Medoids)
7.1.1 Partitioning
7.1.2 Iterative selection of medoids
7.1.3 PAM Algorithm
7.2 CLARA
7.2.1 CLARA algorithm
7.3 CLARANS
7.3.1 CLARANS algorithm
7 Conclusion
8 Reference

Page No.
1
2
2
2
3
4
5
5
6
6
7
7
8
8
8
8
9
10
10
11
12
13
14

ABSTRACT
With the explosive growth of Data, the extraction of useful information
from it has become a major task. Data Mining is the non-trivial process of
identifying valid, novel, potentially useful and ultimately understandable patterns in
data. Among the areas of data mining, the problem of clustering data objects has
received a great deal of attention.
Clustering segments a database into subsets or clusters. Clustering is a
useful technique for discovery of data distribution and patterns in the underlaying
data. The goal of clustering is to discover dense and sparse regions in a data set.
We are mainly focusing on finding Cluster groups in data base,
which we present mainly three algorithms namely,

PAM ( Partitioning Aranound Medoids ) algorithm


CLARA ( Clustering LARge Applications ) algorithm
CLARANS (Clustering Large Applications based on
RANdomized Search)

for

INTRODUCTION
What is Data Mining?
The past two decades has seen a dramatic increase in the amount of
information or data being stored in electronic format. This accumulation of data
has taken place at an explosive rate. It has been estimated that the amount of
information in the world doubles every 20 months and the size and number of
databases are increasing even faster. The increase in use of electronic data
gathering devices such as point-of-sale or remote sensing devices has contributed
to this explosion of available data. The following figure from the Red Brick
Company illustrates the data explosion

Data storage became easier as the availability of large amounts of


computing power at low cost. i.e., the cost of processing power and storage is
falling, made data cheap. There was also the introduction of new machine learning
methods for knowledge representation based on logic programming etc. In addition
to traditional statistical analysis of data. The term Data Mining has been stretched
beyond its limits to apply to any form of data analysis. Some of the numerous
definitions of Data Mining, are:Knowledge Discovery in Databases are:
Data Mining or Knowledge Discovery in Databases (KDD) is the nontrivial extraction of implicit, previously unknown, and potentially useful
information from data. This encompasses a number of different technical
approaches, such as, clustering, data summarization, learning classification rules,
finding dependency networks, analyzing changes and detecting anomalies.
Data Mining is the search for relationships and global patterns that exist in
large databases but are 'hidden' among the vast amount of data, such as, a
relationship between patient data and their medical diagnosis.

The analogy with the mining process is described as:Data Mining refers to "Using a variety of techniques to identify nuggets of
information or decision-making knowledge in bodies of data, and extracting these
in such a way that they can be put to use in the areas such as decision support,
prediction, forecasting and estimation. The data is often voluminous, but as it
stands of low value as no direct use can be made of it; it is the hidden
information in the data that is useful. Recorded events is nothing but the data
providing structure to the data is called information. Day to day information
grows vastly, so extracting useful information is difficult. So we must have set of
tools to analyze the information. This set of tools is called Data Mining.
Knowledge Discovery in Databases:
The term knowledge means relationships and patterns between data
elements. Knowledge discovery process consists of 6 stages, in which Data
Mining is discovery stage of knowledge discovery.
The stages in KDD are:1.
2.
3.
4.
5.
6.

Data Selection
Cleaning
Enrichment
Coding
Data Mining
Reporting.

The fifth stage, the Data Mining, is the phase of real discovery. Data
Mining methodology states that in the optimal situation, data mining is an
ongoing process in which one should continually work on their data, constantly
identify new information needs and trying to improve data to make it match the
goals better, so that any organization becomes a learning system. Since most of
the phases need the input of a great deal of creativity such process enables and
encourages this creativity by refusing to impose any limit on possible activities.

Data Mining
Requirements

Deployment
Interpretation

Data
Selection

Data
Cleaning

Enrichment
and Coding

Data
Mining

Reporting

THE PROCESS OF KNOWLEDGE DISCOVERY

DATA MINING TECHNIQUES:


Data Mining is not single technique as the idea that there is more
knowledge hidden in the data than show itself on the surface. Any technique that
helps extract more out of useful data. So Data Mining techniques form quite
separate heterogeneous groups.
The different Data Mining techniques that are useful are:

Association Rules
Classification Rules
Query Tools
Cluster computing
Statistical Techniques
Decision Trees
Visualization
On Line Analytical Processing (OLAP)
K-nearest neighbours
Neural Networks
Genetic Algorithms

DATA MINING APPLICATIONS


MARKETING MANAGEMENT

RISK MANAGEMENT

FRAUD MANAGEMENT

Target Marketing
Cross Selling
Customer Retention
Market Basket Analysis
Marketing Segmentation

Fore-casting
Customer Retention
Improved Underwriting
Quality Control
Competitor Analysis

Fraud Detection

The above figure relates the techniques of Data Mining to the Applications
in Business environment.

CLUSTERING
Clustering is a division of data into groups of similar objects. Each
group, called cluster, consists of objects that are similar between themselves and
dissimilar to objects of other groups.
An example of clustering is depicted in Figures below. The input patterns
are shown in Figure (a), and the desired clusters are shown in Figure (b). Here,
points belonging to the same cluster are in same group. The variety of techniques
for representing data, measuring proximity (similarity) between data elements, and
grouping data elements has produced a rich and often confusing assortment of
clustering.

Components of a Clustering Task


Typical pattern clustering activity involves the following steps
(1) pattern representation (optionally including feature extraction and/or selection),
(2) definition of a pattern proximity measure appropriate to the data domain,
(3) clustering or grouping,
(4) data abstraction (if needed), and
(5) assessment of output (if needed).
Figure 2 depicts a typical sequencing of the first three of these steps, including a
feedback path where the grouping process output could affect subsequent feature
extraction and similarity computations.

Pattern representation refers to the number of classes, the number of


available patterns, and the number, type, and scale of the features available to the
clustering algorithm. Some of this information may not be controllable by the
practitioner.
Stages in Clustering:
1. Feature selection is the process of identifying the most effective subset of
the original features to use in clustering. Feature extraction is the use of
one or more transformations of the input features to produce new salient
features . Either or both of these techniques can be used to obtain an
appropriate set of features to use in clustering.
2. Pattern proximity is usually measured by a distance function defined on
pairs of patterns. A variety of distance measures are in use in the various
communities. A simple distance measure can often be used to reflect
dissimilarity between two patterns, whereas other similarity measures can be
used to characterize the conceptual similarity between patterns.
3.

The grouping step can be performed in a number of ways. The output


clustering (or clusterings) can be hard (a partition of the data into groups) or
fuzzy (where each pattern has a variable degree of membership in each of
the output clusters).

CLUSTERING TECHNIQUES:
There are two main types of clustering techniques:
1. Partitional clustering:
The partitional clustering techniques construct a partition of the
database
into predefined number of clusters. It attempts to
determine k partitions that optimize a certain criterion function. The
partitional clustering algorithms are of two types:
k-means algorithms
k-medoid algorithms
2. Hierarchical clustering:
The hierarchical clustering techniques do a sequence of partitions in
which each partition is nested into the next partition in the
sequence. It creates a hierarchy of clusters from small to big or big
to small. The partitional clustering algorithms are of two types:
Agglomerative Technique
Divisive technique
In these techniques we are mainly focusing on

PARTITIONAL ALGORITHMS.

PARTITIONAL ALGORITHMS
Partitional algorithms construct a partition of a database of N objects into
a set of k clusters. The construction involves determining the optimal partition
with respect to an objective function. There are approximately k** N/k! Ways of
partitioning a set of N data points into k subsets.
The partitional clustering algorithm usually adopts iterative optimization
paradigm. It starts with an initial partition and uses an iterative control strategy.
It tries swapping of data points to see if such a swapping improves the quality of
clustering. when no swapping yields improvements in clustering it finds a locally
optimal partition. This quality of clustering is very sensitive to initially selected
partition.
There are mainly two different categories of the partitioning algorithms.
1. k-means algorithms, where each cluster is represented by the center of
gravity of the cluster.
2. k-medoid algorithms, where each cluster is represented by one of the
objects of the cluster located near the center.
Most of special clustering algorithms designed for data mining are
k-medoid algorithms.

k-Medoid Algorithms
PAM(Partitioning Around Medoids)
PAM(Partitioning Around Medoids) uses a k-medoid method to identify the
clusters. PAM selects k objects arbitrarily from the data as medoids. In each step,
a swap between a selected object Oi and a non-selected object Oh is made, as
long as such a swap would result in an improvement of the quality of clustering.
To caliculate the effect of such a swap between Oi and Oh a cost Cih is
computed, which is related to the quality of partitioning the non-selected objects
to k-clusters represented by the medoids.
It is necessary first to understand the method of partitioning of the data
objects when a set of k medoids are given.
PARTITIONING
If Oj is a non-selected object and Oi is a medoid, we then say Oj
belongs to the cluster represented by Oi, if d(Oi,Oj)=MIN oe d(Oj,Oe) see Figure
(b) below, where the minimum is taken over all medoids Oe and d(Oa,Oh)
determines the distance of dissimilarity between objects Oa and Oh see Figure (a)
below. The dissimilarity matrix is known prior to the commencement of PAM.
The quality of clustering is measured by the average dissimilarity between an
object and the medoid of the cluster to which the object belongs.

ITEATIVE SELECTION OF MEDOIDS


Let us assume that O1, O2, .. ., Ok are k medoids
We denote C1, C2, ...., Ck are the respective clusters.
discussion, for a non-selected object Oj, j 1, 2, ..,
Min(1<i<k)d(Oj, Oi) = d(Oj, Oh). Let us now analyze the
Oi and Oh.

selected at any stage.


From the foregoing
k if Oj Ch then
effect of swapping

In other words let us compare the quality of clustering, if we select k


medoids as O1, O2, .. ., Oi-1, Oh, Oi+1, .. Ok, where Oh replaces Oi as one of
the medoids, there will be three types of changes that can occur in actual
clustering. They are,
1. A non-selected object Oj, such that Oj Ci before swapping Oj Ch after
swapping.
This case arise when the following conditions hold:
Min d(Oj, Oe) = d(Oj, Oi), before swapping and
Min e i d(Oj, Oe) = d(kOj, Oh) after swapping.

2. A non-selected object Oj, such that Oj Ci before swapping Oj Cj' after


Swapping and j' h
This case arise when the following conditions hold:
Min d(Oj, Oe) = d(Oj, Oi), before swapping and
Min d(Oj, Oe) = d(Oj, Oj'), j' h after swapping.
Define a cost as Cjih = d(Oj, Oj') d(Oj, Oi)

3.

A non-selected object Oj, such that Oj Cj' before swapping Oj Ch


after Swapping
This case arise when the following conditions hold:
Min d(Oj, Oe) = d(Oj, Oj'), before swapping and
Min d(Oj, Oe) = d(Oj, Oh), after swapping.
Define a cost as Cjih = d(Oj, Oh) d(Oj, Oj')

Define the total cost of swapping Oi and Oh as Chi = j Cjih, if Cih is


negative then the quality of clustering is improved by making Oh as a medoid in
place of Oi. The process is repeated until we cannot find a negative Cih. The
algorithm can be finally stated as follows:
PAM Algorithm
Input: data base of object D.
Select arbitrarily k representative objects. Mark these objects as selected and
mark the remaining as non-selected.

Repeat until no more objects are to be classified .

Do for all selected objects Oi.


Do for all non-selected objects Oh.
Compute Cih
End do.
End do.

Select imin,hmin such that C imin,hmin = Min i,h Cih

If C imin,hmin < 0
Then mark Oi as non-selected and Oh as selected
Do repeat

Find cluster C1, C2, .., Ch

CLARA
it can be observed that the major computational efforts for PAM are to
determine k medoids through an iterative optimaization. CLARA though follows
the same principle, attempts
to handle large datasets. Instead of finding
representative objects for the entire dataset, CLARA draws a sample of the
dataset, applies PAM on this sample and finds the medoids of the sample. If the
sample were drawn in a sufficiently random way , the medoids of the sample
would approximate the medoids of the entire dataset. The steps of CLARA are
summarized below:
ALGORITHM:
Input: DataBase of D objects.
Repeat until
Draw a sample S

D randomly from D.

Call PAM (S , K) to get K medoids.


Classify the entire data set D to C1 , C2, , Ck.
Calculate the quality of clustering as the average dissimilarity.
End.
CLARANS
CLARANS (Clustering Large Applications based on RANdomized Search)
is similar to PAM but it applies a randomized Iterative-Optimization for
determination of medoids. It is easy to see that in PAM, at every iteration, we
examine k(N-k) swapping to determine the pair corresponding to minimum cost.
On the otherhand, CLARA tries to examine fewer elements by restricting its
search to smaller sample of the database. Thus if the sample size is S N, it
examines at most k(S-k) pairs at every iteration.
CLARANS does not restrict the search to any particular subset of objects.
Neither does it search the entire data set. It randomly selects few pairs for
swapping at the current state. CLARANS, like PAM, starts with a randomly
selected set of k medoids. It checks at most the maxneighbour number of pairs
for swapping and if a pair is with negative cost is found, it updates the medoid
set and continues. Otherwise, it records the current selection of medoids as a
local optimum and restarts with a new randomly selected medoid-set to search for
another local optimum. CLARANS stops after the numlocal number of local
optimal medoid sets are determined and return best among these.

CLARANS ALGORITHM :

Input(D , k, maxneighbour and numlocal)

Select arbitrarily k representative objects. Mark these objects as selected and


all

other objects as non-selected. Call it current.

Set e = 1.

Do While(e numlocal)

Set j=1.

Do While (j maxneighbour)
Consider randomly a pair (I, h) such that Oi is a selected object and
Oh is a non-selected object.
Calculate the cost Cih.
If Cih is negative
update current
Mark Oi non-selected, Oh selected and j=1
Else
Increment j j+1
End do.

Compare the cost of clustering with mincost

If current_cost < mincost


Mincost current_cost
Best_node current

increment e e+1

End do

Return best node

CONCLUSION
PAM is iterative optimization that combines relocation of points
between perspective clusters with re-nominating the points as potential medoids. The
guiding principle for the process is the effect on an objective function, which,
obviously, is a costly strategy. On the otherhand, CLARA tries to examine fewer
elements by restricting its search to smaller sample of the database. CLARANS does
not restrict the search to any particular subset of objects. It randomly selects few
pairs for swapping at the current state, so it is more efficient than other two medoid
based algorithms.

REFERENCES

Data Mining Techniques by Arun K Pujari

Data Mining Concepts and Techniques by Jiawei Han,


Kamber

Database Mining: A performance perspective. IEEE Dec-1993 by


R. Agrawal, T. Imielinski.

JETE Journal of Research, Vol 47, Jan-2001.

IBM Research Paper, 20th VLDB conference, Santiago, Chile.

Micheline

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy