A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
A Novel Multi-Viewpoint Based Similarity Measure For Document Clustering
M. Tech(CSE), Sri Sai Madhavi Institute of Science & Technology, A.P., India. Asst. Professor, Dept.of CSE, Sri Sai Madhavi Institute of Science & Technology, A.P., India.
Abstract: Data mining is a process of analyzing data in order to bring about patterns or trends from the data. Many
techniques are part of data mining techniques. Other mining techniques such as text mining and web mining also exists. Clustering is one of the most important data mining or text mining algorithm that is used to group similar objects together. In other words, it is used to organize the given objects into some meaningful sub groups that make further analysis on data easier. Clustered groups make search mechanisms easy and reduce the bulk of operations and the computational cost. Clustering methods are classified into data partitioning, hierarchical clustering, data grouping. The aim of this paper is to develop a new method that is used to cluster text documents that have sparse and high dimensional data objects. Like kmeans algorithm, the proposed algorithm work faster and provide consistent, high quality performance in the process of clustering text documents. The proposed similarity measure is based on the multi-viewpoint.
Keywords: Clustering, Euclidean distance, HTML, Similarity measure.
I. INTRODUCTION Clustering is a process of grouping a set of objects into classes of similar objects and is the most interesting concept of data mining in which it is defined as a collection of data objects that are similar to one another. Purpose of Clustering is to group fundamental structures in data and classify them into a meaningful subgroup for additional analysis. Many clustering algorithms have been published every year and can be proposed for developing several techniques and approaches. The kmeans algorithm has been one of the top most data mining algorithms that is presently used. Even though it is a top most algorithm, it has a few basic drawbacks when clusters are of various sizes. Irrespective of the drawbacks is understandability, simplicity, and scalability is the main reasons that made the algorithm popular. K-means is fast and easy to combine with the other methods in larger systems. A common approach to the clustering problem is to treat it as the optimization process. An optimal partition is found by optimizing the particular function of similarity among data. Basically, there is an implicit assumption that the true intrinsic structure of the data could be correctly described by the similarity formula defined and embedded in the clustering criterion function. An algorithm with an adequate performance and usability in most of application scenarios could be preferable to one with better performance in some cases but limited usage due to high complexity. While offering reasonable results, kmeans is fast and easy to combine with the other methods in larger systems. The original k-means has sum-of-squared-error objective function that uses the Euclidean distance. In a very sparse and high-dimensional domain like text documents, spherical k- means, which uses cosine similarity (CS) instead of the Euclidean distance as the measure, is deemed to be more suitable [1], [2]. The nature of similarity measure plays a very important role in the success or failure of the clustering methods. Our objective is to derive a novel method for multi viewpoint similarity between data objects in the sparse and high-dimensional domain, particularly text documents. From the proposed method, we then formulate new clustering criterion functions and introduce their respective clustering algorithms, which are fast and scalable like k-means. II. RELATED WORK Document clustering is one of the important text mining techniques. It has been around since the inception of the text mining domain. It is the process of grouping objects into some categories or groups in such a way that there is maximization of intra cluster object similarity and inter-cluster dissimilarity. Here an object does mean the document and term refers to a word in the document. Each document considered for clustering is represented as an m dimensional vector d. The m represents the total number of terms present in the given document. Document vectors are the result of some sort of the weighting schemes like TF-IDF (Term Frequency Inverse Document Frequency). Many approaches came into existence for the document clustering. They include the information theoretic co-clustering [3], non negative matrix factorization, probabilistic model based method [4] and so on. However, these approaches did not use specific measure in finding the document similarity. In this paper we consider methods that specifically use the certain measurement. From the literature it is found that one of the popular measures is the Eucludian distance:
---- (1) K-means is one of the important clustering algorithms in the world. It is in the list of top 10 clustering algorithms. Due to its simplicity and ease of use it is still being used in the data mining domain. Euclidian distance measure is used in kmeans algorithm. The main purpose of the k-means algorithm is to minimize the distance, as per the Euclidian measurement, between objects in clusters. The centroid of such clusters is represented as follows: www.ijmer.com 1906 | Page
International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol.3, Issue.4, Jul - Aug. 2013 pp-1906-1909 ISSN: 2249-6645
---------- (2) In text mining domain, the cosine similarity measure is also widely used measurement for finding document similarity, especially for hi-dimensional and sparse document clustering . The cosine similarity measure is also used in one of the variants of k-means known as the spherical k-means. It is mainly used to maximize the cosine similairity between the clusters centroid and the documents in the cluster. The difference between k -means that uses the Euclidian distance and the k-means that make use of cosine similarity is that the former focuses on vector magnitudes while the latter focuses on vector directions. Another popular approach is known as the graph partitioning approach. In this approach the document corpus is considered as the graph. Min max cut algorithm is the one that makes use of this approach and it focuses on minimizing the centroid function:
----- (3) CLUTO [1] software package is a method of document clustering based on graph partitioning is implemented. It builds a nearest neighbor graph first and then makes clusters. In this approach for given non-unit vectors of document, the extend Jaccard coefficient is:
-- (4) Both direction and magnitude are considered in the Jaccard coefficients when compared with cosine similarity and Euclidean distance. When the documents in the clusters are represented as unit vectors, the approach is very much similar to cosine similarity. All measures such as the cosine, Euclidean, Jaccard, and Pearson correlation are compared . The conclusion made here is that the Eucldean and the Jaccard are best for web document clustering. In [1], the authors research has been made on categorical data. They both selected related attributes for a given subject and calculated distance between two values. Document similarities can also be found using the approaches that are concept and phrase based. In [1], tree similarity measure is used conceptually while proposed phrase-based approach. Both of them used an algorithm known as the Hierarchical Agglomerative Clustering in order to perform the clustering. For XML documents also measures are found to know the structural similarity [5]. However, they are different from the normal text document clustering. III. PROPOSED WORK The main work is to develop a novel multi viewpoint based algorithm for document clustering which provides maximum efficiency and performance. It is particularly focused in studying and making the use of cluster overlapping phenomenon to design cluster merging criteria. Proposing a new way to compute the overlap rate in order to improve the time efficiency and the veracity is mainly concentrated. Based on the Hierarchical Clustering Method, the usage of the Expectation-Maximization (EM) algorithm in the Gaussian Mixture Model to count the parameters and make the two subclusters combined when their overlap is the largest is narrated. In the simplest case, an optimization problem consists of maximizing or minimizing a real function by systematically choosing the input values from within an allowed set and computing the value of the function. The generalization of optimization theory and the techniques to other formulations comprises a large area of applied mathematics. The cosine similarity can be expressed be expressed as follows:
Sim(d i , d j ) cos(d i 0, d j 0) (d i 0) t (d j 0)
---- (5)
where 0 is vector 0 that represents the origin point. According to this formula, the measure takes 0 as one and only reference point. The similarity between the two documents is defined as follows :
sim(d i , d j )
d i , d j S r
1 n nr
d h S \ S r
sim(d
d h , d j d h ) ---- (6)
The multi view based similarity in equ. (6) depends on particular formulation of the individual similarities within the sum. If the relative similarity is defined by the dot product of the difference vectors, we have: www.ijmer.com 1907 | Page
International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol.3, Issue.4, Jul - Aug. 2013 pp-1906-1909 ISSN: 2249-6645 MVS (d i , d j =
| di , d j Sr )
1 n nr
cos(
dh
d i d h , d j d h ) || d i d h || || d j d h ||
---- (7)
The similarity between the two points di and dj inside cluster Sr, viewed from a point dh outside this cluster, is equal to the product of the cosine of the angle between di and dj looking from dh and the Euclidean distances from dh to these two points. Now we have to carry out a validity test for the cosine similarity and multi view based similarity as follows. For each type of similarity measure, a similarity matrix called A = {aij}nn is created. For CS, this is simple, as aij = dti dj . The procedure for building MVS matrix is described in Procedure 1. procedure BUILDMVSMATRIX(A) Step 1: for r 1 : c do Step 2: DS \Sr
d i S r
Step 3: nS \Sr|S \ Sr| Step 4: end for Step 5: for i 1 : n do Step 6: r class of di Step 7: for j 1 : n do Step 8: if dj Sr then
else
Step 11: Step 12: end if Step 13: end for Step 14: end for Step 15: return A = {aij}nn Firstly, the outer composite with respect to each class is determined. Then, for each row ai of A, i = 1, . . . , n, if the pair of documents di and dj, j = 1, . . . , n are in the same class, aij is calculated as in line 9. Otherwise, dj is assumed to be in dis class, and aij is calculated as in line 11. After matrix A is formed, the code in Procedure 2 is used to get its validity score: procedure GETVALIDITY(validity,A, percentage) Step 1: for r 1 : c do Step 2: qr floor(percentage nr) Step 3: if qr = 0 then Step 4: qr 1 Step 5: end if Step 6: end for Step 7: for i 1 : n do Step 8: {aiv[1], . . . , aiv[n] } Sort {ai1, . . . , ain} Step 9: s.t. aiv[1] aiv[2] . . . aiv[n] {v[1], . . . , v[n]} permute {1, . . . , n} Step 10: r class of di
International Journal of Modern Engineering Research (IJMER) www.ijmer.com Vol.3, Issue.4, Jul - Aug. 2013 pp-1906-1909 ISSN: 2249-6645 Step 14: return validity For each document di corresponding to row ai of A, we select qr documents closest to point di. The value of qr is chosen relatively as the percentage of the size of the class r that contains di, where percentage (0, 1]. Then, validity with respect to di is calculated by the fraction of these qr documents having the same class label with di, as in line 11. The final validity is determined by averaging the over all the rows of A, as in line 13. It is clear that the validity score is bounded within 0 and 1. The higher validity score a similarity measure has, the more suitable it should be for the clustering process. IV. CONCLUSION Clustering is one of the data mining and text mining techniques used to analyze datasets by dividing it into various meaningful groups. The objects in the given dataset can have certain relationships among them. All the clustering algorithms assume this before they are applied to datasets. The existing algorithms for the text mining make use of a single viewpoint for measuring similarity between objects. Their drawback is that the clusters cannot exhibit the complete set of relationships among objects. To overcome this drawback, we propose a new similarity measure known as the multi-viewpoint based similarity measure to ensure the clusters show all relationships among objects. This approach makes use of different viewpoints from different objects of the multiple clusters and more useful assessment of similarity could be achieved.
REFERENCES [1] A. Ahmad and L. Dey, A Method to Compute Distance Between Two Categorical Values of Same Attribute in Unsupervised Learning for Categorical Data Set, Pattern Recognition Letters, vol. 28, no. 1, pp. 110 -118, 2007 [2] S. Zhong, Efficient Online Spherical K-means Clustering, Proc. IEEE Intl Joint Conf. Neural Networks (IJCNN), pp. 3180 -3185, 2005 [3] I. S. Dhillon, S. Mallela, and D. S. Modha, Information -theoretic co-clustering, in KDD, 2003, pp. 8998. [4] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, Clustering on the unit hypersphere using von Mises-Fisher distributions, J. Mach. Learn. Res., vol. 6, pp. 13451382, Sep 2005. [5] S. Flesca, G. Manco, E. Masciari, L. Pontieri, and A. Pugliese, Fast detection of xml structural similarity, IEEE Tra ns. On Knowl. And Data Eng., vol. 17, no. 2, pp. 160 175, 2005.
.
www.ijmer.com
1909 | Page