0% found this document useful (0 votes)
53 views4 pages

An Improved Technique For Document Clustering

Data mining , knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users. In this paper, we are providing a comprehensive survey over the document clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views4 pages

An Improved Technique For Document Clustering

Data mining , knowledge discovery is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users. In this paper, we are providing a comprehensive survey over the document clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Technical Research and Applications e-ISSN: 2320-8163,

www.ijtra.com Volume 3, Issue 3 (May-June 2015), PP. 169-172

AN IMPROVED TECHNIQUE FOR DOCUMENT


CLUSTERING
Priti B. Kudal1, Prof. Manisha Naoghare2
1

Student, Master of Engineering, 2Assistant Professor,


Department of Computer Engineering,
Sir Visvesvaraya Institute of Technology, Chincholi, Sinner.
1
priti_1619@rediffmail.com, 2manisha.naoghare@gmail.com
Abstract Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute best
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding natural clusters and describe their unknown properties
(natural data types), in finding useful and suitable groupings
(useful data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
KeywordsData Mining, Clustering, Classification,
Similarity Measure, Term Frequency.

I.
INTRODUCTION
Document clustering is also applicable in producing the
hierarchical grouping of document (Ward 1963). In order to
search and retrieve then information efficiently in Document
Management Systems (DMS), the metadata set should be
created for the documents with adequate details. But just one
metadata set is not enough for the whole document
management systems. This is because various document types
need different attributes to be distinguished appropriately.
So clustering of documents is an automatic grouping of
text documents into clusters such that documents within a
cluster have high resemblance in comparison to one another,
but are different from documents in other clusters.
Hierarchical document clustering (Murtagh 1984) categorizes
clusters into a tree or a hierarchy that benefits browsing.
Information Retrieval (IR) (Baeza 1992) is the field of
computer science that focuses on the processing of documents
in such a way that the document can be quickly retrieved
based on keywords specified in a users query. IR technology
is the foundation of web-based search engines and plays a key
role in biomedical research, as it is the basis of software that
aids literature search.
II.
LITERATURE SURVEY
Document clustering is the process of categorizing text
document into a systematic cluster or group, such that the
documents in the same cluster are similar whereas the

documents in the other clusters are dissimilar. It is one of the


vital processes in text mining. Liping (2010) emphasized that
the expansion of internet and computational processes has
paved the way for various clustering techniques. Text mining
especially has gained a lot of importance and it demands
various tasks such as production of granular taxonomies,
document summarization etc., for developing a higher quality
information from text.
Likas et al. (2003) proposed the global K-means
clustering technique that creates initial centers by recursively
dividing data space into disjointed subspaces using the Kdimensional tree approach. The cutting hyper plane used in
this approach is the plane that is perpendicular to the
maximum variance axis derived by Principal Component
Analysis (PCA). Partitioning was carried out as far as each of
the leaf nodes possess less than a predefined number of data
instances or the predefined number of buckets has been
generated. The initial center for K-means is the centroids of
data that are present in the final buckets. Shehroz Khan and
Amir Ahmad (2004) stipulated iterative clustering techniques
to calculate initial cluster centers for K-means. This process is
feasible for clustering techniques for continuous data.
Agrawal et al. (2005) ascribed data mining applications
and their various requirements on clustering techniques. The
main requirements considered are their ability to identify
clusters embedded in subspaces. The subspaces contain high
dimensional data and scalability. They also consist of the
comprehensible ability of results by end-users and distribution
of unpredictable data transfer.
The main limitation of K-means approach is that it
generates empty clusters based on initial center vectors.
However, this drawback does not cause any significant
problem for static execution of K-means algorithm and the
problem can be overcome by executing K-means algorithm for
a number of times. However, in a few applications, the cluster
issue poses problems of erratic behavior of the system and
affects the overall performance. Malay Pakhira et al. (2009)
mooted a modified version of the K-means algorithm that
effectively eradicates this empty cluster problem. In fact, in
the experiments done in this regard, this algorithm showed
better performance than that of traditional methods.
Uncertainty heterogeneous data streams (Charu
Aggarwal et .al 2003) are seen in most of the applications. But
the clustering quality of the existing approaches for clustering
heterogeneous data streams with uncertainty is not
satisfactory. Guo-Yan Huang et al. (2010) posited an approach
for clustering heterogeneous data streams with uncertainty. A
frequency histogram using H-UCF helps to trace characteristic
categorical statistic. Initially, creating n clusters by a Kprototype algorithm, the new approach proves to be more
useful than UMicro in regard to clustering quality.
Alam et al. (2010) designed a novel clustering algorithm
by blending partitional and hierarchical clustering called
169 | P a g e

International Journal of Technical Research and Applications e-ISSN: 2320-8163,


www.ijtra.com Volume 3, Issue 3 (May-June 2015), PP. 169-172
HPSO. It utilized the swarm intelligence of ants in a
depend on intrinsic parameters that make way for a
decentralized environment. This algorithm proved to be very
diversity of views.
effective as it performed clustering in a hierarchical manner.
Text clustering is a clustering task in a high-dimensional
Shin-Jye Lee et al. (2010) suggested clustering-based
space, where each word is seen as an important attribute
method to identify the fuzzy system. To initiate the task, it
for a text. Empirical and mathematical analysis have
tried to present a modular approach, based on hybrid
revealed that clustering in high-dimensional spaces is very
clustering technique. Next, finding the number and location of
complex, as every data point is likely to have the same
clusters seemed the primary concerns for evolving such a
distance from all the other data points (Beyer et al. 1999).
model. So, taking input, output, generalization and
Text clustering is often useless, unless it is integrated with
specialization, a HCA has been designed. This three-part
reason for particular texts are grouped into a particular
input-output clustering algorithm adopts several clustering
cluster. It means that one output preferred from clustering
characteristics simultaneously to identify the problem.
in practical settings is the explanation why a particular
Only a few researchers have focused attention on
cluster result was created rather than the result itself. One
partitioning categorical data in an incremental mode.
usual technique for producing explanations is the learning
Designing an incremental clustering for categorical data is a
of rules based on the cluster results. But this technique
vital issue. Li Taoying et al. (2010) lent support to an
suffers from a high number of features chosen for
incremental clustering for categorical data using clustering
computing clusters.
ensemble. They initially reduced redundant attributes if
required, and then made use of true values of different
III. EXISTING SYSTEM
attributes to form clustering memberships.
Clustering is sometimes erroneously referred to as
Crescenzi et al. (2004) cited an approach that
automatic classification; however, this is inaccurate, since the
automatically extracts data from large data-intensive web sites.
clusters found are not known prior to processing whereas in
The data grabber investigates a large web site and infers a
case of classification the classes are pre-defined. In clustering,
scheme for it, describing it as a directed graph with nodes. It
it is the distribution and the nature of data that will determine
describes classes of structurally similar pages and arcs
cluster membership, in opposition to the classification where
representing links between these pages. After locating the
the classifier learns the association between objects and classes
classes of interest, a library of wrappers can be created, one
from a so called training set, i.e. a set of data correctly labeled
per class with the help of an external wrapper generator and in
by hand, and then replicates the learnt behavior on unlabeled
this way suitable data can be extracted.
data.
Miha Grcar et al. (2008) mulled over a technique about
the lack of software mining technique, which is a process of
Drawbacks of Existing System
extracting knowledge out of source code. They presented a
1) K-Medoid Clustering Algorithm
software mining mission with an integration of text mining
Weaknesses:
and link study technique. This technique is concerned with the
a) Relatively more costly; complexity is O( i k (n-k)2 ), where
inter links between instances. Retrieval and knowledge based
i is the total number of iterations, is the total number of
approaches are the two main tasks used in constructing a tool
clusters, and n is the total number of objects.
for software component. An ontology-learning framework
b) Relatively not so much efficient.
named LATINO was developed by Grcar et al. (2006).
c) Need to specify k, the total number of clusters in advance.
LATINO, an open source purpose data mining platform, offers
d) Result and total run time depends upon initial partition.
text mining, link analysis, machine learning, etc.
Similarity-based approach and model-based approaches
2) Hierarchical Clustering Algorithm
(Meila and Heckerman 2001) are the two major categories of
Weaknesses:
clustering approaches and these have been described by Pallav
a) Depends on the scale of data.
Roxy and Durga Toshniwal (2009). The former, capable of
b) Computationally complex for large datasets.
maximizing average similarities within clusters and
c) Different methods sometimes lead to very different
minimizing the same among clusters, is a pairwise similarity
dendrograms
clustering approach. The latter tries to generate techniques
from the documents, each approach representing one
IV. PROPOSED SYSTEM
document group in particular.
A. Architecture of Proposed System
Document clustering is becoming more and more
The outline of the proposed system is as follows:
important with the abundance of text documents available
through World Wide Web and corporate document
management systems. But there are still some major
drawbacks in the existing text clustering techniques that
greatly affect their practical applicability. The drawbacks in
the existing clustering approaches are listed below:
Text clustering that yields a clear cut output has got to be
the most favorable. However, documents can be regarded
differently by people with different needs vis--vis the
clustering of texts. For example, a businessman looks at
business documents not in the same way as a technologist
sees them (Macskassy et al. 1998). So clustering tasks
170 | P a g e

International Journal of Technical Research and Applications e-ISSN: 2320-8163,


www.ijtra.com Volume 3, Issue 3 (May-June 2015), PP. 169-172
1) Preprocessing Module
Input:A document Data Base D and List of Stop words L
Before running clustering algorithms on text datasets, I
D=fd1,d2,d3,.,dkg ;where 1<=k<=i
performed some pre-processing steps. In particular, stop words
tij is the jth term in ith document
(prepositions, pronouns, articles, and irrelevant document
Output: All valid stem text term in D
metadata) have been removed. Also, the Snow balls stemming
for (all di in D) do
algorithm for Portuguese words has been used. Then, I
for (1 to j) do
adopted a traditional statistical approach for text mining, in
Extract tij from di
which documents are represented in a vector space model. In
If(tij in list L)
this model, each document is represented by a vector
Remove tij from di
containing the frequencies of occurrences of words, which are
End for
defined as delimited alphabetic strings, whose number of
End for
characters is between 4 and 25. I also used a dimensionality
reduction technique known as Term Variance (TV) that can
B. Calculating Similarity between two documents
increase both the effectiveness and efficiency of clustering
For i:=0 to N(total documents)
algorithms. TV selects a number of attributes (in our case 100
For j:=0 to N (total documents)
words) that have the greatest variances over the documents. In
Simvalue :=((doc[i]_doc[j]))/Math.sqrt(doc[i]_doc[j])
order to compute distances between documents, two measures
Add simvalue to the list Build matrix;
have been used, namely: cosine-based distance and
Next
Levenshtein-based distance. The later has been used to
Next
calculate distances between file (document) names only.
Where N is total number of documents
doc[i] for i=1,2.....n are documents
2) Calculating the number of clusters
In order to estimate the number of clusters, a widely used
C. Clustering Technique
approach consists of getting a set of data partitions with
K means & improved method are used.
different numbers of clusters and then selecting that particular
Steps of K Means method:
partition that provides the best result according to a specific
quality criterion (e.g., a relative validity index). Such a set of
1) Initialization In this first step data set, number of clusters
partitions may result directly from a hierarchical clustering
and the centroid that we defined for each cluster.
dendrogram or, alternatively, from multiple runs of a
2) Classification The distance is calculated for each data point
partitional algorithm (e.g., K-means) starting from different
from the centroid and the data point having minimum
numbers and initial positions of the cluster prototypes.
distance from the centriod of a cluster is assigned to
thatparticular cluster.
3) Clustering techniques
3) Centroid Recalculation Clusters generated previously, the
The clustering algorithms adopted in our study the
centriod is again repeatly calculated means recalculation of
partitional K-means and K-medoids, the hierarchical
the centriod.
Single/Complete/Average Link, and the cluster ensemble
4) Convergence Condition Some convergence conditions are
based algorithm known as CSPA are popular in the machine
given as below:
learning and data mining fields, and therefore they have been
Stopping when reaching a given or defined number of
used in our study. Nevertheless, some of my choices regarding
iterations.
their use deserve further comments. For instance, K-medoids
Stopping when there is no exchange of data point
is similar to K-means. However, instead of computing
between the clusters.
centroids, it uses medoids, which are the representative objects
Stopping when a threshold value is achieved.
of the clusters. This property makes it particularly interesting
5) If all of the above conditions are not satisfied, then go to
for applications in which (i) centroids cannot be computed;
step 2 and the whole process repeat again, until the given
and (ii) distances between pairs of objects are available, as for
conditions are not satisfied.
computing dissimilarities between names of documents with
the Levenshtein distance.
Steps of improved method:
Output: D = d1, d2, d3,..., di,..., dn== set of documents
4) Removing Outliers
di = x1, x2, x3,..., xi,..., xm k == Number of desired clusters.
I assess a simple approach to remove outliers. This
Input: A set of k clusters.
approach makes recursive use of the silhouette.
1) Calculate distance for each document or data point from
Fundamentally, if the best partition chosen by the silhouette
the origin.
has singletons (i.e., clusters formed by a single object only),
2) Arrange the distance (obtained in step 1) in ascending
these are removed. Then, the clustering process is repeated
order.
over and over again until a partition with-out singletons is
3) Split the sorted list in K equal size sub sets. Also the
found. At the end of the process, all singletons are
middle point of each sub set is taken as the centroid of
incorporated into the resulting data partition (for evaluation
that set.
purposes) as single clusters.
4) Repeat this step for all data points. Now the distance
between each data point & all the centroids is calculated.
V. IMPLEMENTATION
Then the dataset is assigned to the closest cluster.
A. Algorithm for Stop Word-Removal
5) In this step, the centroids of all the clusters are
A typical method to remove stopwords is to compare each
recalculated.
term with acompilation of known stopwords
171 | P a g e

International Journal of Technical Research and Applications e-ISSN: 2320-8163,


www.ijtra.com Volume 3, Issue 3 (May-June 2015), PP. 169-172
Clustering, IEEE/WIC/ACM International Conference on
6) Now for all data points. Now the distance between each
Web Intelligence and Intelligent Agent Technology (WIdata point & all the centroids is calculated.If this distance
IAT), Vol. 2, pp. 64-68, 2010.
is less than or equal to the present nearest distance then
[5]
Baeza-Yates, R.A. Introduction to Data Structures and
the data point stays in the same cluster. Else it is shifted
Algorithms Related to Information Retrieval, In Information
to the nearest new cluster.
Retrieval: Data Structures and Algorithms, W. B. Frakes and
VI.

RESULT
[6]

[7]

[8]

[9]

[10]

[11]

VII. CONCLUSION
As clustering plays a very vital role in various
applications, many researches are still being done. The
upcoming innovations are mainly due to the properties and the
characteristics of existing methods. These existing approaches
form the basis for the various innovations in the field of
clustering. From the existing clustering techniques, it is clearly
observed that the clustering techniques provide significant
results and performance. Hence, this research concentrates
mainly on the clustering for better performance.

[12]

[13]

[14]

[15]

VIII. ACKNOWLEDGEMENT
I would like to express my profound gratitude and deep
regard to my Project Guide Prof. M.M. Naoghare, for her
exemplary counsel, valuable feedback and constant fillip
throughout the duration of the project. Her suggestions were of
immense help throughout my project work. Working under her
was an extremely knowledgeable experience for me.
[1]

[2]

[3]

[4]

REFERENCES
Priti B. Kudal,Prof. M.M.Naoghare,A Review of Modern
Document Clustering Techniques,International Journal of
Science & Research(IJSR), Volume 3 Issue 10, October 2014.
An Improved Hierarchical Technique for Document
Clustering Priti B. Kudal, Prof. Manisha Naoghare,
International Journal of Science & Research(IJSR), Volume 4
Issue 4, April 2015.
Agrawal, Rakesh, Gehrke, Johannes, Gunopulos, Dimitrios,
Raghavan and Prabhakar, Automatic subspace clustering of
high dimensional data, Data Mining and Knowledge
Discovery (Springer Netherlands) Vol. 11, pp. 5-33,
DOI:10.1007/s10618-005-1396-1, 2005.
Alam, S., Dobbie, G., Riddle, P. and Naeem, M.A. Particle
Swarm Optimization Based Hierarchical Agglomerative

[16]

[17]

[18]

[19]

[20]

[21]

R. Baeza-Yates, Eds. Prentice- Hall, Inc., Upper Saddle


River, New Jersey, pp. 13-27, 1992.
Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S.
Yu, A Framework for Clustering Evolving Data Streams,
Proceedings of the 29th international conference on Very
Large Data Bases (VLDB), pp. 81-92, 2003.
Crescenzi valter, Giansalvatore Mecca, Paolo Merialdo and
Paolo Missier, An Automatic Data Grabber for Large Web
Sites, VLDB , pp. 1321-1324, 2004
Grcar, M., Mladenic, D., Grobelnik, M., Fortuna, B. and
Brank, J. Ontology Learning Implementation, Project report
IST-2004-026460 TAO, WP 2, D2.2, 2006.
Guo-Yan Huang, Da-Peng Liang, Chang-Zhen Hu and JiaDong Ren, An algorithm for clustering heterogeneous data
streams with uncertainty, 2010 International Conference on
Machine Learning and Cybernetics (ICMLC), Vol. 4, pp.
2059-2064, 2010.
Li Taoying, Chne Yan, Qu Lili and Mu Xiangwei,
Incremental clustering for categorical data using clustering
ensemble, 29th Chinese Control Conference (CCC), pp.
2519-2524, 2010.
Likas, A., Vlassis, N. and Verbeek, J.J. The Global k-means
Clustering algorithm, Pattern Recognition , Vol. 36, No. 2,
pp. 451-461, 2003.
Lijuan Jiao and Liping Feng, Text Classification Based on
Ant Colony Optimization, Third International Conference on
Information and Computing (ICIC), Vol. 3, pp.229 - 232,
2010.
Macskassy, S.A., Banerjee, A. Davison, B.D. and Hirsh, H.
Human Performance On Clustering Web Pages: A
Preliminary Study, In Proc. of KDD-1998, New York, USA,
pp. 264-268, Menlo Park, CA, USA, 1998.
Malay K. Pakhira, A Modified k-means Algorithm to Avoid
Empty, International Journal of Recent Trends in
Engineering, Vol. 1, No. 1, pp. 220-226, 2009.
Meila, M. and Heckerman, D. An experimental comparison
of model-based clustering methods, Machine Learning,
kluwer Academic publishers, Vol. 42, pp. 9-29, 2001.
Miha Grcar, Marko Grobelnik and Dunja Mladenic, Using
Text Mining and Link Analysis for Software Mining,
Lecture Notes in Computer Science, Vol. 4944, pp. 1-12,
2008.
Murtagh, F. A Survey of Recent Advances in 6ierarchical
Clustering Algorithms Which Use Cluster Centers, Comput.
J, Vol. 26, pp. 354-359, 1984
Pallav Roxy and Durga Toshniwal, Clustering Unstructured
Text Documents Using Fading Function, International
Journal of Information and Mathematical Sciences, Vol. 5,
No. 3, pp. 149-156, 2009
Shehroz S. Khan and Amir Ahmad, Cluster Center
Initialization Algorithm for K-means Clustering, Pattern
Recognition Letters, Vol. 25, No. 11, pp. 1293-1302, 2004.
Shin-Jye Lee and Xiao-Jun Zeng, A three-part input-output
clustering-based approach to fuzzy system identification,
2010 10th International Conference on Intelligent Systems
Design and Applications (ISDA), pp. 55-60, 2010.
Ward Jr, J.H. Hierarchical grouping to optimize an objective
function, J. Am. Stat. Association, Vol. 58, pp. 236-244,
1963.

172 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy