Automatic Document Clustering and Knowledge Discovery
Automatic Document Clustering and Knowledge Discovery
Automatic Document Clustering and Knowledge Discovery
16 www.erpublication.org
Automatic Document Clustering and Knowledge Discovery
A. CLUSTERING:
Document or Text Clustering is an unsupervised technique in
which no input output patterns are pre - defined. The
clustering is done in an efficient manner if the documents of
intra cluster are more similar than the inter-cluster documents.
Clustering differs from categorization as the documents are
clustered on the fly instead of having trained datasets.
17 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-2, Issue-12, December 2014
(where, ti= terms and di=documents..i = 1,2,3,) i. Start with document d[i,j] in the matrix with CS
A weighting scheme for document weights is defined by value 1, where (i=j).
considering the frequency of terms within a document. If the ii. Take the mean of row or column the obtained value
Documents contain the same keywords they are similar will be a threshold.
Therefore we can use the term frequency tf(i,j) i.e. number of iii. Now search for documents in the row or column
times a term i occurs in a document j. The term frequency is whose value satisfies the above threshold.
normalized with respect to the maximal frequency of all terms iv. Cluster the documents who satisfy the threshold in
occurring in a document [1]. the category Ci = {d1,d2, dm,. | where m is any
tf (i, j) = freq (i,j) number between 1 to n}.
max {f(k,j)} v. For all the documents that are clustered in first
We should also consider how frequent a term is in the category , their CS value will be changed to some
document collection of size n. The Document frequency (dfi) other value, and now these documents will not be
of a term is the number of documents in which term i occurs. considered further at any stage to avoid
If D is the number of documents in a database then, idf (i, j) = redundancy. Documents with CS value 1 are
log (D/ dfi) considered for the formation of the clusters.
Term weight is given by: Wi = tfi * log (D/dfi). vi. Repeat steps i to v for making more clusters.
Now, the Cosine similarity (the cosine of the angle of two 3. /* generation of number of clusters */
vectors) has to be calculated by the following formula: i. Initially count =0
Cosine similarity between two vectors x and y is calculated ii. After formation of each cluster count is
as: incremented by 1
n iii. Finally display count
x y xi yi 4. Display output in the form c1={d1, d2, d6,
cos( x , y ) i 1
d8..} , and so on.
x y
n n
x2
i 1 i i 1
yi2 B. KNOWLEDGE DISCOVERY:
Knowledge discovery from textual database refers generally
Hence, for any two given documents dj and dk, their similarity to the process of extracting interesting information from
is: unstructured text documents. Here, knowledge is discovered
in the form of association rules. These rules can be extracted
n
d j dk wi , j wi ,k from texts using GARW (Generation of association rules
sim(d j , d k ) i 1
based on weighting scheme) algorithm. These rules would
n n
d j dk wi2, j wi2,k give us the idea regarding the documents of the clusters, and
i 1 i 1 would help us in making correct choice for the topics in which
we are interested.
It has the following properties :
a) If two vectors coincide completely their similarity should This phase describe a way for finding from a collection of
be maximal, i.e. 1. documents by automatically extracting association rules from
b) If two vectors have no keywords in common, the similarity them . Given a set of keywords A ={ w1 ,w2 ,..., wn} and a
should be minimal, i.e. 0. collection of indexed documents D = {d1,d2,..., dm }, where
c) In other cases the similarity should be between 0 and 1. each document di is a set of keywords such that di A. Let
Wi be a set of keywords. A document di is said to contain Wi
if and only if Wi di . An association rule is an implication
d) Proposed Algorithm: of the form Wi Wj where Wi A , Wj A and Wi Wj
Input : Document set D = { d1, d2, d3, , dn} = . There are two important basic measures for association
Output : No of clusters and Set of cluster numbers C along rules, support(s) and confidence(c). The rule Wi Wj has
with documents associated. support s in the collection of documents D if s% of documents
1. Preprocessing in D contain Wi Wj . The support is calculated by the
following formula:
2. CS ixj = Support (Wi,Wj) =Support count of Wi Wj / Total number of
documents D.
and the confidence of a rule is defined as: conf(xy) =
1 d(1,2) d(1,3) ... d(1,n) supp(X u Y) / supp(X). [2][3]
d(2,1) 1 d(2,3)..d(2,n)
d(3,1) d(3,2) 1 d(3,n) An association rule-mining problem is broken into two steps:
. . . 1) generate all the keyword combinations (keywordsets)
. . . whose support is greater than the user specified minimum
d(n,1) d(n,2) d(n,3) 1 support (called minsup). Such sets are called the frequent
keywordsets .
2) use the identified frequent keywordsets to generate the
rules that satisfy a user specified minimum confidence (called
minconf). The frequent keywords generation requires more
3. /* clustering */ effort and the rule generation is straightforward.
18 www.erpublication.org
Automatic Document Clustering and Knowledge Discovery
We have an algorithm for Generating Association Rules Therefore, it means time required to make clusters
based on Weighting scheme (GARW). and the space required in memory for storing these
The GARW algorithm is as follows: clusters is also less.
Let N be the number of keywords that satisfy the threshold
value of weight. 12
Store the n keywords in a hashmap along with their 10
frequencies in all document and their weight values
8
TF-IDF.
Scan the hashmap and find all keywords that satisfy the 6
threshold minimum support. k means
4
Generate candidate keywords from large frequent Proposed
2
keywordset..
Compare the frequencies of candidate keyword sets with 0
Cluster 8
Cluster 2
Cluster 4
Cluster 1
Cluster 3
Cluster 5
Cluster 6
Cluster 7
minimum support
Generate different association rules from candidate
keywordset , that satisfy the threshold minimum
confidence.
X axis : cluster no
The extracted association rules can be reviewed in textual Y axis : no of documents
format or tables. The extracted association rules contain
important features and describe the informative news 4
included in the documents collection. 3.5
The GARW algorithm reduces the execution time in 3
comparable to the Apriori algorithm because it does not make
multiple scanning on the original documents like Apriori but 2.5
it scan only the file or hashmap which contains all the 2
keywords that satisfy the threshold weight value and their 1.5
frequencies in each document [8]. Apriori-based system
generates all frequent keywordsets from all keywords in the 1
documents that are important and unimportant. This leads to 0.5
extract interesting and uninteresting rules. In contrast, the 0
system based on the GARW algorithm generates all frequent kmeans proposed
keywordsets from mostly important keywords in the
documents based on the weighting scheme. Here, the X axis: algorithm type
weighting scheme plays an important role for selecting Y axis : time in seconds for cluster generation
important keywords for association rules generation. This
leads to extract the more interesting rules in short time. Phase 2: Knowledge Discovery
Here, we have used GARW algorithm for generation of
IV. EXPERIMENTAL RESULTS: association rules, in this algorithm first the keywords who
For experimental purpose we have used the 20 newsgroup donot satisfy the threshold weight is removed thus, only
dataset, from that we retrieved some documents and used important keywords will be remained hence we get important
them for our experiments. rules.
For analysis of results, we applied k means and our However,if apriori algorithm is used for
proposed algorithm on these dataset. We got the generation of rules it would lead to all important and
following observations: unimportant rules thereby, consuming time and memory as
well.
Phase 1: Clustering So, on each cluster obtained in phase 1, we apply both these
a) K means: algorithm for analysis purpose and we got the following
If we give k = 10 , our output consists of 10 clusters observations.
whereas maximum clusters are 4 only (according to
the documents taken for experiment) , similarly if we
give k = 15, 15 clusters will be obtained. Some of the
clusters would be singleton( i.e. clusters with just
one document) as well.
Therefore, it means time required to make so many
clusters and the space required in memory for storing
so many clusters is also more.
b) Proposed algorithm:
Here, there is no need to specify k (the number of X axis : cluster no
clusters) and we get 4 clusters from the input text Y axis : no of rules generated.
dataset.
19 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869, Volume-2, Issue-12, December 2014
V. CONCLUSION:
Our proposed system automatically groups unknown text
dataset in optimum clusters as it does not takes number of
clusters as input. It also generates association rules for each
cluster so that it would discover the knowledge or the topic
about the documents of the clusters and would give the idea to
the user about them, so that if the user is interested in that
topic he would have a detailed information about that cluster
without concerning the other clusters, thereby manual effort
and time will be reduced.
REFERENCES
[1] Madhura Phatak , Ranjana Agrawal , A Novel Algorithm for
Automatic Document Clustering , 2013 3rd IEEE International
Advance Computing Conference (IACC).
[2] Ms. Anjali Ganesh Jivani, The Novel k Nearest Neighbor Algorithm
, 2013 International Conference on Computer Communication and
Informatics (ICCCI -2013).
[3] Ms. Vaishali Bhujade, Prof. N. J. Janwe, Ms. Chhaya Meshram ,
Discriminative Features Selection in Text Mining Using TF-IDF
Scheme , International Journal of Computer Trends and Technology-
July to Aug Issue 2011.
[4] Manjot Kaur , Navjot Kaur Web Document Clustering Approaches
Using K-Means Algorithm , International Journal of Advanced
Research in Computer Science and Software Engineering , Volume
3, Issue 5, May 2013 ISSN: 2277 128X.
[5] Hongwei Yang , A Document Clustering Algorithm for Web Search
Engine Retrieval System , 2010 International Conference on
e-Education, e-Business, e-Management and e-Learning.
[6] http://pyevolve.sourceforge.net/wordpress/?p=97
[7] http://people.csail.mit.edu/jrennie/20Newsgroup.
[8] Vaishali Bhujade , N.J. janwe , Knowledge discovery in text mining
using association rules extraction , 2011 IEEE computer society.
[9] Hany Mahgoub, Dietmar Rsner, Nabil Ismail and Fawzy Torkey , A
Text Mining Technique Using Association Rules Extraction ,
International Journal of Information and Mathematical Sciences 4:1
2008.
[10] Dr. Varun Kumar1, Anupama Chadha2 , Mining Association Rules
in Students Assessment Data , IJCSI International Journal of
Computer Science Issues, Vol. 9, Issue 5, No 3, September 2012.
[11] Shady Shehata , AWordNet-based Semantic Model for Enhancing
Text Clustering , 2009 IEEE International Conference on Data
Mining Workshops .
[12] Mary Posonia A , Dr V LJyothi , Structural- based Clustering
Technique of XML
[13] Documents 2013 International Conference on Circuits, Power and
Computing Technologies [ICCPCT-2013].
20 www.erpublication.org