Introduction To KEA-Means Algorithm For Web Document Clustering
Introduction To KEA-Means Algorithm For Web Document Clustering
Introduction To KEA-Means Algorithm For Web Document Clustering
Department of Computer Engineering, Pune Department of Information Technology, Pune SIT, Lonavala, India
Abstract-- In most traditional techniques of document clustering, the number of total clusters is not known in advance and the cluster that contains the target information or prcised information associated with the cluster cannot be determined. This problem solved by K-means algorithm. By providing the value of no. of cluster k. However, if the value of k is modified, the precision of each result is also changes. To solve this problem, this paper introduces a new clustering algorithm known as KEA-Means algorithm which will combines the kea i.e. key phrase extraction algorithm which returns several key phrases from the source documents by using some machine learning language by creating model which will contains some rule for generating the no. of clusters of the web documents from the dataset .this algorithm will automatically generates the number of clusters at the run time here. User need not to specify the no. of clusters. This Kea-means clustering algorithm provides the value of k and will be beneficial to extract test documents from massive quantities of resources. General Terms--Data mining, kea-means algorithm, k-means algorithm, Web document clustering Keywords-- K-means clustering, Kea key phrase extraction algorithm, KEA-Means algorithm, F-measure.
information cannot be determined since the semantic nature is not associated with the cluster. To solve this problem, this paper proposed a new clustering algorithm based on the Kea key phrase algorithm that extracts several Key phrases from source documents by using some machine learning techniques. The Kea-means clustering algorithm provides easy and efficient ways to extract test documents from massive quantity of resources. II. LITERATURE SURVEY Shen Huang, Zheng Chen, Yong Yu, and Wei-Ying Ma (2006) [7], proposed Multitype Features Co-selection for Clustering (MFCC), a novel algorithm to exploit different types of features to perform Web document clustering. They have use the intermediate clustering result in one feature space as additional information to enhance the feature selection in other spaces. Consequently, the better feature set co-selected by heterogeneous features will produce better clusters in each space. After that, the better intermediate result will further improve co-selection in the next iteration. Finally, feature co-selection is implemented iteratively and can be well integrated into an iterative clustering algorithm. Michael Steinbach, George Karypis, Vipin Kumar (2000) [2], presented the results of an experimental study of some common document clustering techniques: agglomerative hierarchical clustering and K-means. (They have used both a standard K-means algorithm and a bisecting K-means algorithm.) These results indicate that the bisecting K-means technique is better than the standard K-means approach and (somewhat surprisingly) as good as or better than the hierarchical approaches that tested by them. Jiang-chun song, jun-yi shen (2003) [9], based on the Vector Space Model (VSM) of the Web documents, they have improved the nearest neighbor method, put forward a new Web document clustering algorithm, and researched the validity and scalability of the algorithm, the time and space complexity of the algorithm and they have shown that their algorithm better than k- mean algorithm. Yan Gao, Shiwen Gu, Liming Xia and Yaoping Fei (2006), In this paper, they have proposed a new multi-view information bottleneck algorithm (MVIB) that extends information bottleneck algorithm to multi-view setting to
I. INTRODUCTION The increasing size and dynamic content of the World Wide Web has created a need for automated organization of web-pages. Document clusters can provide a structure for organizing large bodies of text for efficien6t browsing and searching. For this purpose, a web-page is typically represented as a vector consisting of the suitably normalized frequency counts of words or terms. Each document contains only a small percentage of all the words ever used in the web. If we consider each document as a multi-dimensional vector and then try to cluster documents based on their word contents, the problem differs from classic clustering scenarios in several ways. Document clustering data is high dimensional, characterized by a highly sparse word-document matrix with positive ordinal attribute values and a significant amount of outliers [1]. The main drawbacks of traditional techniques of the document clustering is that the number of total clusters is not known in priori and the cluster that contain the target
ISSN: 2231-280
http://www.internationaljournalssrg.org
Page 630
Where P is the data point, C is the cluster center, and n is the number of features. The algorithm is composed of the following steps [3]: 1. Place k points into the space represented by the documents that are being clustered. These points represent initial group centroids. 2. Assign each document to the group that has the closest centroid. 3. When all documents have been assigned, recalculate the positions of the k centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the documents into groups from which the metric to be minimized can be calculated. C. Kea: Automatic Keyphrase Extraction Keyphrase extraction is a method of automatically extracting important phrases from text mainly for document summarization [4, 5].Kea is a Java-based keyphrase extraction algorithm that generates candidate phrases from a document and selects keyphrases from them by using TF-IDF and naive Bayes classifier [6]. Keas extraction algorithm has two stages: 1. Training: The training stage uses a set of training documents for which the authors keyphrases are known. For each training document, candidate phrases are identified and their feature values are calculated. Each phrase is then marked as a keyphrase or a no Keyphrase, using the actual keyphrases for that document. 2. Extraction: In the extraction phase, the algorithm chooses keyphrases from a new document using the model. To select keyphrases from a new document, Kea determines candidate phrases and feature values, and then applies the model built during training. The model determines the overall probability that each candidate is a keyphrase, and then a post processing operation selects the best set of keyphrases. D. F-measure The second external quality measure is the F-measure, a measure that combines the precision and recall ideas from information retrieval. We treat each cluster as if it were the result of a query and each class as if it were the desired set of documents for a query. We then calculate the recall and precision of that cluster for each given class. More specifically, for cluster j and class is given in Eq. given below,
ISSN: 2231-280
http://www.internationaljournalssrg.org
Page 631
Where the max is taken over all clusters at all levels, and n is the number of documents. E. Dataset This project uses the Reuter dataset. Reuters-21578 is a collection of 21,578 newswire articles. The articles are assigned classes from a set of 119 topic categories. This project has extracted two smaller sample data from the Reuters-21578 for cluster evaluation [10].
IV. THE PROPOSED APPROACH The following fig. shows the proposed model for the clustering of the web document.
Analysis of web document Extract heuristics And Extract key phrases Grouping of doc. into clusters
Select cluster containing key phrase and extract document and Find F-measure
Proposed method involves following Analysis methods1. Extract Heuristics and Key Phrases from documents. 2. 3. 4. 5. Determine the number of Clusters Grouping of documents into clusters. (K-means) Find Similar Key Phrases in each Cluster Compare Similar Key Phrase with threshold
ISSN: 2231-280
http://www.internationaljournalssrg.org
Page 632
K-means Clustering
Kea-means Clustering
The K-means algorithm uses the cosine measure or the Euclidean distance measure to calculate feature values, the Kea-means clustering algorithm uses these two measures simultaneously. Hence, the similarity of two documents is computed by the following expression:
ISSN: 2231-280
http://www.internationaljournalssrg.org
Page 633
REFERENCES
[1] Alexander S., Joydeep G. and Raymond M 2000.Impact of similarity measures on web page clustering. University of Texas at Austin, TX, 78712-1084, USA. M.Steinbach, G.Karypis, V.Kumar 2000.A comparison of document clustering techniques.proc.KDD Workshop on Text Mining, 1-20. Teknomo, Kardi. K-Means Clustering http:\\people.revoledu.com\kardi\tutorial\kMean\ Tutorials. [9]
Keyphrase Extraction. Dept. of computer science university of Waaikato. [7] Shen Huang, Zheng Chen, Yong Yu, and Wei Ying Ma. 2006. Multitype Features Coselection for Web Document Clustering. IEEE transactions on knowledge and data engineering, vol. 18, no. 4, April 2006. [8] Shobha Sanjay Raskar and D.M. Thakore 2010. Kea-mean clustering approach for text mining. International Journal of Power Control Signal and Computation (IJPCSC) Vol. 2 No. Jiang- Chun Song, Jun-Yi Shen 2003.A web document clustering algorithm based on concept of neighbor. Proceedings of the Second International Conference on Machine Learning and Cybernetics Wan, 2-5 November 2003. .http://www.daviddlewis.com/resources/testcollections/reuters21578 /
P.Turney 1999.Coherent keyphrase extraction via web mining, Technical Report ERB-1057, Institute for Information Technology, National Research Council of Canada. P.Turney 2003.Learning to extract keyphrases from text, proc.18th International Joint Conference on Artificial Intelligence (IJCAI), 434-439, 2003. Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin and Craig G. Nevill-Manning 1999.KEA: Practical Automatic
[5]
[6]
ISSN: 2231-280
http://www.internationaljournalssrg.org
Page 634