A New Hybrid K-Means and K-Nearest-Neighbor Algorithms For Text Document Clustering
A New Hybrid K-Means and K-Nearest-Neighbor Algorithms For Text Document Clustering
A New Hybrid K-Means and K-Nearest-Neighbor Algorithms For Text Document Clustering
net/publication/271025170
CITATIONS READS
3 229
2 authors, including:
Kamal Khalilpour
Islamic Azad University, Mahabad Branch
7 PUBLICATIONS 27 CITATIONS
SEE PROFILE
All content following this page was uploaded by Kamal Khalilpour on 27 November 2016.
DOI: 10.7813/2075-4124.2014/6-3/A.12
ABSTRACT
Due to the extensive amount of text documents which significantly are available through the internet and
other resources, in the case of lack of indexation and improper clustering, the act of recovery and processing of
text documents without construction encountered with many problems. Text document clustering is considered
one of the most effective ways of organizing text documents. With doing the clustering process, a wide gamut of
scattered data is placed under organized and similar groups. Clustering's aim is fast and secure access to
associated documents and identifying the logical connection between them. For optimal clustering, the algorithms
should be used that have good accuracy and efficiency at identifying similar text. So, the algorithms' precision in
the processing of documents, considered one of the most important factors in clustering. Clustering is one of the
best ways provided to work with the data. Clustering makes possible the capability of entrance to data space and
diagnosing data structures. So it is considered one of the ideal mechanisms for working with the enormous world
of data. In this paper, we used K-means algorithm and the K-Nearest-Neighbor (KNN) algorithm that are
accounted the most famous data mining algorithms for clustering of text documents. Test results show that K-
means and KNN hybrid algorithm have good accuracy in clustering text documents.
1. INTRODUCTION
Text documents are considered the most important sources of information in many of detection and control
systems. With the development of information technology and the increasing development of information, large
volumes of text documents are increasing rapidly. Clustering of text documents are examples of the methods that
makes it possible to organize and facilitate the process of intelligent documents and access them more easily [1].
Clustering is the process of finding the collection of objects in clusters by which the data in each cluster have the
most similarity to each other and the least similarity to members of other clusters [2]. Due to the increasing
amount of text data, auto- clustering of texts is a good solution for unstructured information. In fact, clustering is
from such data mining descriptive techniques which extract similar patterns from the data with no predetermined
goal [3]. Data mining is a very efficient method for discovering information from structured and free structure data
[4]. Using data mining can discover and cluster the relationships between data.
Clustering is finding the related structure in a set of data that are not classified. In other words, it can be
said that clustering is putting the data in groups whose members are similar in each group from a certain view. As
a result, the similarity between data within each cluster is the maximum and the similarity between the data at
different clusters is the minimum. In clustering the data, the scale of similarity is the distance and instances which
are in a cluster are closer to each other. For example, in documents' clustering, the farness or closeness of data
are proportional to the number of common words in two documents or in clustering of the purchase of customers'
cart, the distance is determined by the similarity in purchase [5,6].
Among data mining techniques, K-means [7] algorithm is widely used algorithm to detect related data to
the different clusters. The algorithm attributes the data due to close similarity with clusters. Also the KNN
algorithm [8] chooses a set of documents which is the most similar to each other (based on a defined similarity
measure) as neighbors.
In this paper, we have proposed a new method based on the KNN and K-means algorithms for text
documents clustering. The proposed model includes the steps of processing and clustering of documents. In the
process phase, the best features of each text document extracted and used K-means algorithm for clustering and
then in the next step, KNN algorithm based on the nearest distance relates documents to one of the clusters. In
the proposed model, for increasing the accuracy of clustering nearest distance criterion is used in the KNN
algorithm. In this order that the features related to each cluster is chosen using nearest distance criterion to be.
Baku, Azerbaijan| 79
INTERNATIONAL JOURNAL of ACADEMIC RESEARCH Vol. 6. No. 3. May, 2014
Evaluation and the results of this paper have been done on Reuters-21578 dataset that includes various types of
text documents [9].
The structure of the paper is as follows: in Section 2, we will the related works done in the field of text
documents clustering; in Section 3, will describe the proposed model; in Section 4, we will show the evaluation
and the results of proposed model and finally in Section 5, conclusion and future works is presented.
2. RELATED WORKS
In recent years text documents clustering has widely been considered. So, the clustering of text
documents is considered as one of the most active researchers in the field of artificial intelligence and information
technology. Many studies have been done, but the range of activities is so wide that more attention is needed.
Clustering is one of the most important techniques of data mining in the topic of text documents detection.
In text documents clustering, the aim is optimizing in the minimum or maximum number of clusters. K-means
algorithm is used for text documents' clustering [10]. The purpose of applying the K-means algorithm is to select
the best cluster center closest to documents. Test results show that the K-Means algorithm has good
performances in clustering text documents with minimal runtime. Using a combination of Naïve Bayes algorithm
and Self-Organizing Map (SOM), a new model is proposed for text documents clustering [11]. In this method, the
problem of dependence on the initial number of clusters and the initial location of the cluster centers has been
improved. Also the inability of documents clustering which their distance from the center of a cluster is the same is
improved. Another benefit of this combination is to reduce the computational complexity. Using the combination of
Particle Swarm Optimization (PSO) and K-means algorithms, new model for clustering of text documents is
presented [12]. Using PSO algorithm are all related words and words that better fit are selected from the space of
text and clustering is done based on them. Experimental results show that the hybrid algorithm has better
performance than K-means algorithm.
To increase the performance accuracy of K-means is used Genetic Algorithm (GA) for selecting relevant
words to similar clustering [13]. GA has the ability to improve the clustering accuracy with iteration and creating
different generation and to reduce the computational cost. Experimental results show that the accuracy of GA is
greatly increased K-means algorithm. Researchers [14] are discussed text documents clustering by using SOM
and K-means algorithms. Their results show that K-means algorithm is very sensitive to the initial values, so SOM
algorithm has better performance in documents clustering. Another method of text documents clustering is using
a combination of fuzzy logic and Ant Colony Optimization (ACO) algorithm [15]. The results of the hybrid model
are across improving the clustering performance. The advantage of the ACO algorithm is the prediction of the
number of clusters for documents. And the results demonstrate the performance of hybrid algorithm in
comparison with K-Means. Another model based on the combination of GA and Differential Evolution (DE) is
presented for text documents clustering [16]. Operators of crossover and mutation are used in order to optimize
the number of cluster. Test results show that GA character is from DE algorithm and also hybrid algorithm is more
efficient than the GA and DE characters.
Text documents clustering have been performed based on Clustering based on Frequent Word
Sequences and Clustering based on Frequent Word Meaning Sequences techniques [17]. In the proposed
models paid attention to the meaning and the proximity of words. Clustering is one of the most important data
mining techniques clustering to discover rules from among of the objects. In [18] Medoid and KNN hybrid
algorithms are used for text documents clustering. Medoid algorithm is from clustering algorithms. Test results
show that the hybrid algorithm clusters has better performance than Medoid.
3. PROPOSED MODEL
One of the problems of text documents clustering is selecting similarity feature between documents [19,
20]. Often due to unrelated and disparate specificity, efficiency of clustering is reduced. Thus, distance criterion is
very important to measure efficiency rating. In this paper, we have provided a two stage approach for documents
clustering. At first, document processing is done using K-means algorithm. At this stage, the documents which
have the most association with each other and those which are less related to each other will be placed in
different clusters. Then in the second phase, we will use the KNN algorithm due to performance of clusters and
similarity of text documents.
One of the popular techniques in the field of clustering algorithm is K-Means [7]. Despite the simplicity of
the algorithm K-means, it considered a basic method for many clustering methods. Algorithm K-means is one of
the current methods in clustering that despite its many advantages such as high speed and ease of
implementation does not trap in a local optimum solution and can produce the optimal answer of the question.
The purpose function in K-means algorithm is defined in equation (1).
k n 2
( j)
J x i cj (1)
j 1 i 1
In equation (1), the parameter k is the number of points of clusters' centers. Each data sample is attributed
to the cluster center that has the least distance to that data.
KNN algorithm requires a metering function to find the best neighborhood [8]. This nearness is defined by
Euclidean distance and if the two points which contain O1=(x11,x12,x13,….x1n) and O2=(x21,x22,x23,….x2n), their
Euclidean distance between them is defined according to equation (2).
k
d (O1, O2 ) (x 1, j x2, j ) 2 (2)
j 1
In equation (2), once the Euclidean distance between points were calculated, the data which among k
neighbors contain the majority is given to unknown sample by sorting the elements by Euclidean distance.
Therefore, in this paper, we used the combination of K-means and KNN algorithms for clustering of text
documents in order to improve the efficiency of clustering similarity. In Fig 1 a flowchart of the proposed model is
shown.
The algorithm ends when all clusters optimized and the recovery will not be achieved in the response.
Given that in the study of each clusters and cluster centers selection, the clustering is done based on K-Means,
therefore, in the case of a change in the objective function other cluster centers will be changed accordingly.
Clusters' reviewing causes finding better solutions and causes to be close to the optimal solution. Fig 2 shows the
quasi code of the proposed model.
1. Start
2. Initial Parameters
D: document collection D
3. Repeat
3.1. Choose centroids from D
3.2. Assigns data to closest cluster based on centroid
Calculate the centroid for each cluster
Calculate the Euclidean Distances
Evaluate distances between data of clusters
Cluster update
3.3. Distance optimization by using KNN
3.4. Similarity Calculation
4. until (satisfying the condition)
5. Put all documents in cluster
6. End.
Baku, Azerbaijan| 81
INTERNATIONAL JOURNAL of ACADEMIC RESEARCH Vol. 6. No. 3. May, 2014
distance between the clusters is very important because the quality of the final results depends on it. To obtain
benefit of text documents clustering, we will use the combination of KNN algorithm and K-means algorithm. KNN
algorithm's benefits are its simplicity, efficiency, high speed and low number of parameters. One of the major
problems in clustering is choosing appropriate number of clusters. for suitable clustering, two criteria should be
used since if the density criterion just will be used, in that case, each data can be regarded as one cluster
because no cluster is not as dense as a cluster with one datum and if just the dispersion criterion will be regarded,
in this case, the best clustering is that we regard all data as a cluster. And the distance of each cluster is zero
from itself. So, the combination of two above criteria should be used. To determine the correct number of clusters,
different evaluation functions have been defined that can use them to find the number of clusters. Purity and
Entropy in the model as the evaluation criteria are considered in equations (3) and (4) [21, 22].
nj
Purity arg max Pij (3)
j 1 n
1 m nj m
Entropy Pij log Pij
log c j 1 n i 1
(4)
Entropy determines the purity of collection of data (disorder or lack of purity). In equations (3) and (4), the
parameters of Pij show the probability that a document of the cluster j belongs to class i. Parameter m is the
number of clusters and parameter nj is the number of documents in cluster j. Parameter n is the total number of
documents. Clustering of text documents have been implemented using a combination of K-means and KNN
algorithms in the programming language of VC #.NET 2008. In Fig 3, the scheme of the program is shown.
In this section the results and evaluation of the proposed model on Reuter-21578 dataset will be reviewed.
Reuter-21578 dataset contains 22 files. The first 21 files (reut2-000.sgm to reut2-020.sgm) contain 1000
documents and the last file (reut2-021.sgm) contains 578 documents. Simulation has been performed on a
diverse set of documents. To demonstrate the performance, Purity and Entropy criteria can be tested on the
algorithms. Clustering is formed based on data that are close to each other, and further distance to the existence
of other clusters. Purpose of assessing the number of clusters is finding the sum of squared distances to the
nearest cluster be the minimum.
In Table (1), Purity criterion with k values on the proposed model and K-means has been evaluated.
According to the documents' space, especially in the hybrid mode for cluster performance can be used different
values of k. With increasing k value, further documents accompany in the cluster's decision-making and accuracy
of the results is reduced. Because most of documents are involved in clustering with low similarity scores and this
reduces the efficiency of documents' clustering. However, the proposed model is able to have better Purity
accuracy than K-means with the increasing of k value.
Algorithms Purity %
Number of Documents k-values
K-Means Proposed Model
3 76.85 87.32
100 4 68.23 72.37
5 55.92 67.15
3 72.15 78.52
250 4 65.14 69.11
5 58.16 64.92
3 65.83 71.23
500 4 54.20 68.88
5 48.92 62.25
In Fig 4, the chart of Purity comparison for the proposed model and the K-means model on documents of 100,
250 and 500 are shown. As you can see, Purity criterion in the proposed model is more accurate.
Fig. 4. the Chart of Purity Comparison for the Proposed Model and the K-means
In Table (2), the effect of k on the proposed model and K-means has been evaluated. As it can be seen in
the proposed model, Entropy criterion is less. And how much entropy will be less, data's disorder will be less and
the accuracy of the clusters will be increased.
Algorithms Entropy %
Number of Documents k-values
K-Means Proposed Model
3 76.40 61.12
100 4 69.65 56.85
5 60.74 49.17
3 68.15 58.78
250 4 65.23 50.36
5 58.62 47.58
3 62.09 55.13
500 4 58.76 44.37
5 51.27 41.11
Comparison of Table (2) shows that the proposed model has produced better results compared to K-
means algorithm. Indeed Comparison of Purity, Entropy and the value of k shows that the proposed model has
been able to increase the efficiency and effectiveness of clustering. In Fig 5, the chart of Entropy comparison is
shown for the proposed model and the K-means model on documents of 100, 250 and 500. As you can see
Entropy criteria in the proposed model is more accurate.
Fig. 5. the Chart of Entropy Comparison for the Proposed Model and the K-means
Baku, Azerbaijan| 83
INTERNATIONAL JOURNAL of ACADEMIC RESEARCH Vol. 6. No. 3. May, 2014
The evaluation criteria reveals the fact that dependence on sample size, number of clusters and the
number of parameters can be very effective and the proposed model has better performance when compared
with K-means. Thus, based on the modeling can be say in most cases the proposed model has better accuracy.
Text documents clustering are a process for managing documents, development of documents and
reducing the spread of information. Often, in text documents' clustering, we face with very high dimensions of data
space. In clustering, clusters are not already distinguished and also it is unknown in terms of what properties,
clustering can be done. As a result, after clustering by efficient algorithm, created clustering should be interpreted
and in some cases after reviewing of clusters, it is necessary that some of the parameters which are regarded in
clustering, but are irrelevant or have little significance removed and clustering is done from the beginning. In this
paper, we used the combination of K-means algorithm and KNN that are of the most important data mining
algorithms for clustering text documents. The results show in this paper that the proposed model has better
efficiency and accuracy in the evaluation of Purity and Entropy criteria than K-means. We hope with providing this
paper to use other combination of data mining algorithms in future for text documents clustering.
REFERENCES
1. Kalogeratos A. Likas. Document Clustering using Synthetic Cluster Prototypes. Data & Knowledge
Engineering, Volume 70, 3;284-306, 2011.
2. V. Radhakrishna, C. Srinivas, C.V. Guru Rao. Document Clustering Using Hybrid XOR Similarity
Function for Efficient Software Component Reuse. Procedia Computer Science, 17;121-128, 2013.
3. Chien-Liang Liu, Wen-Hoar Hsaio, Chia-Hoang Lee, Chun-Hsien Chen. Clustering Tagged
Documents with Labeled and Unlabeled Documents. Information Processing & Management,
49;3:596-606, 2013.
4. T. Velmurugan. Performance Based Analysis between K-Means and Fuzzy C-Means Clustering
Algorithms for Connection Oriented Telecommunication Data. Applied Soft Computing, 19;134-46,
2014.
5. C. Tang, S. Wang, W. Xu. New Fuzzy C-means Clustering Model Based on the Data Weighted
Approach. Data & Knowledge Engineering, 69;9:881-900, 2010.
6. M.A. Qady, A. Kandil. Automatic Clustering of Construction Project Documents Based on Textual
Similarity. Automation in Construction, 42;36-49, 2014.
7. J.B. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations. In
Proceedings of 5th Berkley Symposium on Mathematical Statistics and Probability, 1;281-297, 1967.
8. R.O. Duda, P.E. Hart. Pattern Classification and Scene Analysis. New York: Wiley, 1973.
9. D.D. Lewis. Reuters-21578 Text Categorization Test Collection. Distribution 1.0, 1997.
10. C. Luo, Y. Li, S.M. Chung. Text Document Clustering Based on Neighbors. Data & Knowledge
Engineering, 68;11:1271-1288, Elsevier B.V, November 2009.
11. D. Isa, V.P. Kallimani, L.H. Lee. Using the Self-Organizing Map for Clustering of Text Documents,
Expert Systems with Applications. 36;5:9584-9591, Elsevier Ltd, 2009.
12. X. Cui, T.E. Potok. Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm. Journal
of Computer Sciences, pp. 27-33, 2005.
13. H. Verma, E. Kandpal, B. Pandey, J. Dhar. A Novel Document Clustering Algorithm Using Squared
Distance Optimization Through Genetic Algorithms. International Journal on Computer Science and
Engineering, 2;5:1875-79, 2010.
14. Y. Chen, B. Qin, T. Liu, Y. Liu, Sheng Li, The Comparison of SOM and K-means for Text Clustering.
3;2:268-274, 2010.
15. J. Rajaie,B. Fakhar, A Novel Method for Document Clustering using Ant-Fuzzy Algorithm. The
Journal of Mathematics and Computer Science, 4;2:189-196, 2012.
16. Y.K. Meena, Shashank, V.P. Singh. Text Documents Clustering using Genetic Algorithm and
Discrete Differential Evolution. International Journal of Computer Applications, 43;1:16-19, April 2012.
17. Y. Li, S.M. Chung, J.D. Holt. Text Document Clustering Based on Frequent Word Meaning
Sequences. Data & Knowledge Engineering, 64;381-404, Elsevier B.V, 2008.
18. A. Kalogeratos, A. Likas. Document Clustering Using Synthetic Cluster Prototypes. Data &
Knowledge Engineering, 70;284–306, Elsevier B.V, 2011.
19. F.S. Gharehchopogh, Z.A. Khalifelu. Study on Information Extraction Methods in Unstructured Data:
Text Mining versus Natural Language Processing. AWERProcedia Information Technology &
Computer Science Journal, 1;1321-27, 2012.
20. F.S. Gharehchopogh, Z.A. Khalifelu. Analysis and Evaluation of Unstructured Data: Text Mining
versus Natural Language Processing. 5th International Conference on Application of Information and
Communication Technologies (AICT2011), IEEE press, pp. 1-4, Baku, Azerbaijan, 12-14 October
2011.
21. C.J. Van Rijsbergen. Information Retrieval. Second ed., Buttersworth, London, 1979.
22. M. Steinbach, G. Karypis, V. Kumar. A Comparison of Document Clustering Techniques. In: KDD
Workshop on Text Mining. 2000.