Datamining & Cluster Coputing
Datamining & Cluster Coputing
DEPARTMENTOFCOMPUTERSCIENCE&ENGINNERING
By
R. CHANDRA SEKHAR
(02121A0514)
V. PENCHALA PRAVEEN
(02121A0563)
DEPARTMENTOFCOMPUTERSCIENCE&ENGINNERING
SREEVIDYANIKETHANENGINEERINGCOLLEGE
SREESAINATHNAGAR,A.RANGAMPET,(A.P)517102
INDEX
1.Abstract
2.Introduction
2.1 What is data mining
2.2 Knowledge discovery in databases are
2.3 knowledge discovery in databases stages
3.Data Mining
3.1 Techniques
3.2 Applications
4.Clustering
4.1 Components of Clustering Task
4.2 Stages in Clustering
5. Clustering Techniques
6. Partitional Algorithms
7 k- medoid Algorithms
7.1 PAM (Partitioning Around Medoids)
7.1.1 Partitioning
7.1.2 Iterative selection of medoids
7.1.3 PAM Algorithm
7.2 CLARA
7.2.1 CLARA algorithm
7.3 CLARANS
7.3.1 CLARANS algorithm
7 Conclusion
8 Reference
Page No.
1
2
2
2
3
4
5
5
6
6
7
7
8
8
8
8
9
10
10
11
12
13
14
ABSTRACT
With the explosive growth of Data, the extraction of useful information
from it has become a major task. Data Mining is the non-trivial process of
identifying valid, novel, potentially useful and ultimately understandable patterns in
data. Among the areas of data mining, the problem of clustering data objects has
received a great deal of attention.
Clustering segments a database into subsets or clusters. Clustering is a
useful technique for discovery of data distribution and patterns in the underlaying
data. The goal of clustering is to discover dense and sparse regions in a data set.
We are mainly focusing on finding Cluster groups in data base,
which we present mainly three algorithms namely,
for
INTRODUCTION
What is Data Mining?
The past two decades has seen a dramatic increase in the amount of
information or data being stored in electronic format. This accumulation of data
has taken place at an explosive rate. It has been estimated that the amount of
information in the world doubles every 20 months and the size and number of
databases are increasing even faster. The increase in use of electronic data
gathering devices such as point-of-sale or remote sensing devices has contributed
to this explosion of available data. The following figure from the Red Brick
Company illustrates the data explosion
The analogy with the mining process is described as:Data Mining refers to "Using a variety of techniques to identify nuggets of
information or decision-making knowledge in bodies of data, and extracting these
in such a way that they can be put to use in the areas such as decision support,
prediction, forecasting and estimation. The data is often voluminous, but as it
stands of low value as no direct use can be made of it; it is the hidden
information in the data that is useful. Recorded events is nothing but the data
providing structure to the data is called information. Day to day information
grows vastly, so extracting useful information is difficult. So we must have set of
tools to analyze the information. This set of tools is called Data Mining.
Knowledge Discovery in Databases:
The term knowledge means relationships and patterns between data
elements. Knowledge discovery process consists of 6 stages, in which Data
Mining is discovery stage of knowledge discovery.
The stages in KDD are:1.
2.
3.
4.
5.
6.
Data Selection
Cleaning
Enrichment
Coding
Data Mining
Reporting.
The fifth stage, the Data Mining, is the phase of real discovery. Data
Mining methodology states that in the optimal situation, data mining is an
ongoing process in which one should continually work on their data, constantly
identify new information needs and trying to improve data to make it match the
goals better, so that any organization becomes a learning system. Since most of
the phases need the input of a great deal of creativity such process enables and
encourages this creativity by refusing to impose any limit on possible activities.
Data Mining
Requirements
Deployment
Interpretation
Data
Selection
Data
Cleaning
Enrichment
and Coding
Data
Mining
Reporting
Association Rules
Classification Rules
Query Tools
Cluster computing
Statistical Techniques
Decision Trees
Visualization
On Line Analytical Processing (OLAP)
K-nearest neighbours
Neural Networks
Genetic Algorithms
RISK MANAGEMENT
FRAUD MANAGEMENT
Target Marketing
Cross Selling
Customer Retention
Market Basket Analysis
Marketing Segmentation
Fore-casting
Customer Retention
Improved Underwriting
Quality Control
Competitor Analysis
Fraud Detection
The above figure relates the techniques of Data Mining to the Applications
in Business environment.
CLUSTERING
Clustering is a division of data into groups of similar objects. Each
group, called cluster, consists of objects that are similar between themselves and
dissimilar to objects of other groups.
An example of clustering is depicted in Figures below. The input patterns
are shown in Figure (a), and the desired clusters are shown in Figure (b). Here,
points belonging to the same cluster are in same group. The variety of techniques
for representing data, measuring proximity (similarity) between data elements, and
grouping data elements has produced a rich and often confusing assortment of
clustering.
CLUSTERING TECHNIQUES:
There are two main types of clustering techniques:
1. Partitional clustering:
The partitional clustering techniques construct a partition of the
database
into predefined number of clusters. It attempts to
determine k partitions that optimize a certain criterion function. The
partitional clustering algorithms are of two types:
k-means algorithms
k-medoid algorithms
2. Hierarchical clustering:
The hierarchical clustering techniques do a sequence of partitions in
which each partition is nested into the next partition in the
sequence. It creates a hierarchy of clusters from small to big or big
to small. The partitional clustering algorithms are of two types:
Agglomerative Technique
Divisive technique
In these techniques we are mainly focusing on
PARTITIONAL ALGORITHMS.
PARTITIONAL ALGORITHMS
Partitional algorithms construct a partition of a database of N objects into
a set of k clusters. The construction involves determining the optimal partition
with respect to an objective function. There are approximately k** N/k! Ways of
partitioning a set of N data points into k subsets.
The partitional clustering algorithm usually adopts iterative optimization
paradigm. It starts with an initial partition and uses an iterative control strategy.
It tries swapping of data points to see if such a swapping improves the quality of
clustering. when no swapping yields improvements in clustering it finds a locally
optimal partition. This quality of clustering is very sensitive to initially selected
partition.
There are mainly two different categories of the partitioning algorithms.
1. k-means algorithms, where each cluster is represented by the center of
gravity of the cluster.
2. k-medoid algorithms, where each cluster is represented by one of the
objects of the cluster located near the center.
Most of special clustering algorithms designed for data mining are
k-medoid algorithms.
k-Medoid Algorithms
PAM(Partitioning Around Medoids)
PAM(Partitioning Around Medoids) uses a k-medoid method to identify the
clusters. PAM selects k objects arbitrarily from the data as medoids. In each step,
a swap between a selected object Oi and a non-selected object Oh is made, as
long as such a swap would result in an improvement of the quality of clustering.
To caliculate the effect of such a swap between Oi and Oh a cost Cih is
computed, which is related to the quality of partitioning the non-selected objects
to k-clusters represented by the medoids.
It is necessary first to understand the method of partitioning of the data
objects when a set of k medoids are given.
PARTITIONING
If Oj is a non-selected object and Oi is a medoid, we then say Oj
belongs to the cluster represented by Oi, if d(Oi,Oj)=MIN oe d(Oj,Oe) see Figure
(b) below, where the minimum is taken over all medoids Oe and d(Oa,Oh)
determines the distance of dissimilarity between objects Oa and Oh see Figure (a)
below. The dissimilarity matrix is known prior to the commencement of PAM.
The quality of clustering is measured by the average dissimilarity between an
object and the medoid of the cluster to which the object belongs.
3.
If C imin,hmin < 0
Then mark Oi as non-selected and Oh as selected
Do repeat
CLARA
it can be observed that the major computational efforts for PAM are to
determine k medoids through an iterative optimaization. CLARA though follows
the same principle, attempts
to handle large datasets. Instead of finding
representative objects for the entire dataset, CLARA draws a sample of the
dataset, applies PAM on this sample and finds the medoids of the sample. If the
sample were drawn in a sufficiently random way , the medoids of the sample
would approximate the medoids of the entire dataset. The steps of CLARA are
summarized below:
ALGORITHM:
Input: DataBase of D objects.
Repeat until
Draw a sample S
D randomly from D.
CLARANS ALGORITHM :
Set e = 1.
Do While(e numlocal)
Set j=1.
Do While (j maxneighbour)
Consider randomly a pair (I, h) such that Oi is a selected object and
Oh is a non-selected object.
Calculate the cost Cih.
If Cih is negative
update current
Mark Oi non-selected, Oh selected and j=1
Else
Increment j j+1
End do.
increment e e+1
End do
CONCLUSION
PAM is iterative optimization that combines relocation of points
between perspective clusters with re-nominating the points as potential medoids. The
guiding principle for the process is the effect on an objective function, which,
obviously, is a costly strategy. On the otherhand, CLARA tries to examine fewer
elements by restricting its search to smaller sample of the database. CLARANS does
not restrict the search to any particular subset of objects. It randomly selects few
pairs for swapping at the current state, so it is more efficient than other two medoid
based algorithms.
REFERENCES
Micheline