Text mining is the process of obtaining
useful and interesting information from text. Huge
amount of text data is available in the form of various
formats. Most of it is unstructured.Text mining usually
involves the process of structuring the input text which
involves parsing it, structuring it by inserting results into
a database, deriving patterns from the structured data,
and finally evaluation and interpretation of the output.
There are several data mining techniques proposed for
mining useful patterns in text documents. Mining
techniques can use either terms or pattern (or
phrases).Theoretically using patterns rather than using
terms in text mining may yield good results,but it is not
proved and there is a need for effective ways of mining
text.
This paper present means to classify text using termbased
approach
Text mining is the process of obtaining
useful and interesting information from text. Huge
amount of text data is available in the form of various
formats. Most of it is unstructured.Text mining usually
involves the process of structuring the input text which
involves parsing it, structuring it by inserting results into
a database, deriving patterns from the structured data,
and finally evaluation and interpretation of the output.
There are several data mining techniques proposed for
mining useful patterns in text documents. Mining
techniques can use either terms or pattern (or
phrases).Theoretically using patterns rather than using
terms in text mining may yield good results,but it is not
proved and there is a need for effective ways of mining
text.
This paper present means to classify text using termbased
approach
Text mining is the process of obtaining
useful and interesting information from text. Huge
amount of text data is available in the form of various
formats. Most of it is unstructured.Text mining usually
involves the process of structuring the input text which
involves parsing it, structuring it by inserting results into
a database, deriving patterns from the structured data,
and finally evaluation and interpretation of the output.
There are several data mining techniques proposed for
mining useful patterns in text documents. Mining
techniques can use either terms or pattern (or
phrases).Theoretically using patterns rather than using
terms in text mining may yield good results,but it is not
proved and there is a need for effective ways of mining
text.
This paper present means to classify text using termbased
approach
Text mining is the process of obtaining
useful and interesting information from text. Huge
amount of text data is available in the form of various
formats. Most of it is unstructured.Text mining usually
involves the process of structuring the input text which
involves parsing it, structuring it by inserting results into
a database, deriving patterns from the structured data,
and finally evaluation and interpretation of the output.
There are several data mining techniques proposed for
mining useful patterns in text documents. Mining
techniques can use either terms or pattern (or
phrases).Theoretically using patterns rather than using
terms in text mining may yield good results,but it is not
proved and there is a need for effective ways of mining
text.
This paper present means to classify text using termbased
approach
International Journal of Computer Trends and Technology (IJCTT) volume 11 number 1 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page1
Effective Classification of Text A Saritha ,Student,SIT,JNTUH N NaveenKumar,Asst.Prof, SIT,JNTUH,Hyderabad
ABSTRACT: Text mining is the process of obtaining useful and interesting information fromtext. Huge amount of text data is available in the formof various formats. Most of it is unstructured.Text mining usually involves the process of structuring the input text which involves parsing it, structuring it by inserting results into a database, deriving patterns fromthe structured data, and finally evaluation and interpretation of the output. There are several data mining techniques proposed for mining useful patterns in text documents. Mining techniques can use either terms or pattern (or phrases).Theoretically using patterns rather than using terms in text mining may yield good results,but it is not proved and there is a need for effective ways of mining text. This paper present means to classify text using term- based approach Keywords Classification , Nave Bayes classifier, Preprocessing, Stopword Removal, Stemming, , Unigram, N-grammodels 1. INTRODUCTION The corporate data is becoming double in size. Inorder the utilize that data for business needs, an automated approach is Text mining. By mining that text required knowledge can be retrieved which will be very useul. Knowledge from text usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Text analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, informati-on extraction, data mining techniques including link and association analysis, visualization,and predictive analytics. A typical application of text mining is to scan given set of documents written in a natural language and either to model themfor predictive classification or populate a database or search index with the information extracted. Text Mining and Text analytics:Most of business- relevant information originates in unstructured formt. The term Text Analytics describes a set of linguistic ,statistical, and machine learning techniques useful for business intelligence, to solve business problems, to exploratory data analysis, research, or investigation. These techniques and processes discover and present knowledge facts, business rules, and relationships 1.1 Data Mining versus Text Mining Data Mining and Text mining both are semi-automated processes seek for useful patterns but the difference is in the nature of the data: Structured versus unstructured data: Structured data is the data present in database where as unstructured data is the data present in documents such as word documents, PDF files, text excerpts, XML files, and so on Data Mining is a process of extracting knowledgs fromstructured data. Text mining first , impose structure to the data, and then mine the structured data 1.2 Applications of Text Mining The technology is now broadly applied for a wide variety of government, research, and business needs Some applications and application areas of text mining are, Publishing and Media Customer service, email support Spam filtering Measuring customer preferences by analyzing qualitative interviews Creating suggestion and recommendations Monitoring public opinions(for example in blogs and review sites) Automatic labeling of documents in business libraries Political institutions,political analytics, public administration and legal documents International Journal of Computer Trends and Technology (IJCTT) volume 11 number 1 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page2
Fraud detection by investigating notification of claims Fighting cyberbullying or cybercrime in IM and IRC chat Pharmaceutical and research companies and healthcare 1.3 SystemArchitecture
Fig: 1.1 An example Text Mining System Architecture Starting with a collection of documents, a text mining tool would retrieve a particular document and preprocess it by checking format and character sets. Then it would go through a text analysis phase, sometimes repeating techniques until information is extracted. Three text analysis techniques are shown in the example, but many other combinations of techniques could be used depending on the goals of the organization. The resulting information can be placed in a management information system, yielding an abundant amount of knowledge for the user of that system. 2. CLASSIFICATION 2.1 Introduction Text Classification tasks can be divided into two sorts: supervised document classification where some external mechanism(such as human feedback) provides information on the correct classification for documents or to define classes for the classifier, and unsupervised document classification (also known as document clustering), where the classification must be done without any external reference, this system do not have predefined classes. There is also another task called semi-supervised document classification, where some documents are labeled by the external mechanism (means some documents are already classified for better learning of the classifier). 2.1.1Need for Automatic Text Classification To classify millions of text document manually is an expensive and time consuming task. Therefore, automatic text classifier is constructed using pre-classified sample documents whose accuracy and time efficiency is much better than manual text classification. 2.1.2 Text classification Framework
Fig:2.1 Text classification Frame Work 2.2 PreProcessing 2.2.1 Need for PreProcessing The main objective of pre-processing is to obtain the key features or key terms from stored text documents and to enhance the relevancy between word and document and the relevancy between word and category. Pre-Processing step is crucial in determining the quality of the next stage, that is, the classification stage. It is important to select the significant keywords that carry the meaning and discard the words that do not contribute to distinguishing between the documents. The pre-processing phase of the study converts the original textual data in a data-mining ready structure.
Fig: 2.2 Flow diagramof Preprocessing task 2.2.2 StopWord Removal A stopwords is a commonly occurring grammatical word that does tell us anything about documents content. Words such as a,an, `the',`and', etc are stopwords. The process of stopword removal is to examine documents content for stopwords and write any non-stopwords to a International Journal of Computer Trends and Technology (IJCTT) volume 11 number 1 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page3
temporary file for the document. We are then ready to performstemming on that file. Stop Words are words which do not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return vast amount of unnecessary information. Stop words vary from systemto system. For Example Consider the following paragraph. It is typical of the present leadership of the Union government that sensitive issues which impact the public mind are broached or attended to after the moment has passed, undermining the idea of communication between citizens and the government they have elected. Stopward removal in the above paragraph means removing the words such as a ,an ,the and etc. So fromthe above paragraph the words it,is,of,the,that,which,are,or,to,has,they have will be removed. So the text we get after removing stopwords from the above paragraph will be as follows. typical present leadership Union government sensitive issues impact public mind broached attended moment passed, undermining idea communication between citizens government elected. 2.2.2 Stemming In this process we find out the root/stemof a word. The purpose of this method is to remove various suffixes, to reduce number of words, to have exactly matching stems, to save memory space and time. For example, the words producer, producers, produced, production, producing can be stemmed to the word Produce After performing stemming on the result of stopword removal on the sample text above, we get the following text as result. typical present leadership Union government sensitive issues impact public mind broach attend moment pass, undermine idea communicat between citizen government elect. 2.2.3 Bag-Of-Words Model The bag-of-words model is a simplifying representation used in natural language processing and information retrieval . In this process, a text is represented as the container of its words, disregarding grammar and even word order but keeping multiplicity.
The bag-of-words model is commonly used in methods of document classification, where the (frequency of) occurrence of each word is used as a feature for training a classifier.
Ex:The following models a text document using bag-of-words. Here are two simple text documents: D1: Ajay likes to play cricket. Anil likes cricket too. D2: Ajay also likes to play basketball. A dictionary is constructed with the words in these two documents as follows,.This do not preserve the order of words in the sentences {Ajay:1, play:2,likes:3,to:4,cricket:5, basketball:6, Anil:7also:8,too:9} The dictionary has 9 distinct words, so each document is represented by a 9- entry vector as follows D1: {1, 1, 2, 1, 2, 0, 1, 0, 1} D2: {1, 1, 1, 1, 0, 1, 0, 1, 0} Each entry the document vector represents the frequency of occurrence of the words in the dictionary in the respective document 2.3 Feature Vector Generation In pattern recognition and in image processing, feature extraction is a special formof dimensionality reduction. When the input data to an algorithmis too large to be processed, then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is called feature extraction. A feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, when representing texts perhaps to term occurrence frequencies. The vector space associated with these vectors is often called the feature space. In order to reduce the dimensionality of the feature space, a number of dimensionality reduction techniques can be employed. If the features extracted are carefully chosen it is expected that the features set will extract the relevant information fromthe input data in order to perform the desired task using this reduced representation instead of the full size input. Higher-level features can be obtained fromalready available features and added to the feature vector. Feature construction is the International Journal of Computer Trends and Technology (IJCTT) volume 11 number 1 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page4
application of a set of constructive operators to a set of existing features resulting in construction of new features. Term Frequency and Inverse Document Frequency (TF-IDF) : The tf-idf weight ( term frequency - inverse document frequency ) is a statistical measure used to evaluate how important a word is to a document in a collection The importance increases proportionally to the number of times a word appears in the document but is set by the frequency of the word in the corpus. For a given termw j and document d i , n ij
is the number of occurrences of w j in document d i Term Frequency Tf ij =n ij /|d i | d i | is the number of words in document d i Inverse Document Frequency IDF j =log(n j /n) n is the no.of documents n j is the no.of documents that contain w j 2.3.1 Unigram model The simplest model, the unigrammodel, treats the words in isolation. If There are 16 words in an example text, and the word X occurs twice and thus has a probability of (2/16)=.125 On the other hand, if a word Y occurs only once will have the probability of(1/16)=.0625 This kind of information can be used to judge the well-formedness of texts. This is analogous to, say, identifying some new text as being in the same language The way this works is that we calculate the overall probability of the new text as a function of the individual probabilities of the words that occur in it. On this view, the likelihood of a text like X Y Z where X, Y, Z are the words would be a function of the probabilities of its parts:.125, .125, and .125. If we assume the choice of each word is independent, then the proba bility of the whole string is the product of the independent words, in this case: .125.125.125 =.00195. The way this model works is that the well- formedness of a text fragment is correlated with its overall probability. Higher-probability text fragments are more well-formed than lower- probability texts. A major shortcoming of this model is that it makes no distinction among texts in terms of ordering. Thus this model cannot distinguish ab fromba. 2.3.2 Bigrammodel A more complex model that captures some of the ordering restrictions that may occur in some language or text: bigrams . The basic idea behind higher-order N-gram models is to consider the probability of a word occurring as a function of its immediate context. In a bigram model, this context is the immediately preceeding word: p(w1w2...wi)=p(w1)p(w2|w1). . . p(wi|wi1) We calculate conditional probability in the usual fashion. p(wi|wi1) =|wi1wi|/|wi1| Calculating conditional probabilities is then a straightforward matter of division. For example, the conditional probability of Z given X: p(Z|X) =|X Y|/|X|=2/2= 1 However, the conditional probability of Z given Q: p(Z|Q) =|Q Z|/|Q|=0/1= 0 Using conditional probabilities thus captures the fact that the likelihood of Z varies by preceeding context: it is more likely after Z than after Q. Different orderings are distinguished in the bigram model. Consider, for example, the difference between X Z and Z X. 2.3.3 N-grammodel N-gram models are not restricted to unigrams and bigrams; higher-order N-gram models are also used. These higher-order models are characterized as we would expect. For example, a trigrammodel would view a text w1w2. . . wnas the product of a series of conditional probabilities: p(w1w2...wn) =p(w1)p(w2|w1)p(wn|wn2wn1) One way to try to appreciate the success of N-gram language models is to use them to approximate text in a generative fashion. That is, we can compute all the occurring N-grams over some text, and then use those N-grams to generate new text. 2.4 Classification Techniques There are many techniques which are used for text classification. Following are some techniques: 2.4.1 Nearest Neighbour classifier 2.4.2 Bayesian Classification International Journal of Computer Trends and Technology (IJCTT) volume 11 number 1 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page5
2.4.3 Support Vector Machine 2.4.4 Association Based Classification 2.4.5 TermGraph Model 2.4.6 Decision Tree Induction 2.4.7 Centroid based classification 2.4.8 Classification using Neural Network This paper presents classification of text using Nave Bayes classifier 2.4.2 Bayesian Classification The Nave Bayes Classifier is the simplest probabilistic classifier used to classify the text documents. It severe assumption that each feature word is independent of other feature words in a document. The basic idea is to use the joint probabilities of words and categories to estimate the class of a given document. Given a documentdi , the probability with each class cj is calculated as P(cj/di)=P(di/cj)p(cj)/P(di) As P(d i ) is the same for all class, then label(d i ) is the class (or label) of di , can be determined by label(d i )= arg Maxc j { P(c j /d i )}= arg Max{P(c j) )/ P(d i /c j )P(c j )} This technique Classify using probabilities and assuming independence among terms P(C/X i X j X k ) = P(C) P(X i /C) P(X j /C) P(X k /C) 3. RESULTS 3.1 Dataset The data set that is considered in this work is the Reuters Corpus Volume In 2000 Reuters released a corpus of Reuters News stories for use in research and development of natural language-processing, information-retrieval or machine learning systems. The Reuters Corpus is divided into a number of volumes. Each volume is identified by a volume number and a description, e.g. Reuters Corpus Volume 1 (English Language, 1996-08-20 to 1997-08-19). The training data is classified into 5 categories with labels acq,alum,barley,carcass and coffee. In this work a set of documents belonging to each class are considered. The categories from reuters that are considered as training data are acq, alum, barley, carcass ,and coffee Ten documents of each class i.e., totally 50 documents are considered as training data set. For testing under same class labels five documents from each class i.e., totally 25 documents are considered. 3.2 Work Using weka converter the documents are loaded into a single text document The following are the commands that performvarious tasks in text classification using Nave Bayes classifier Initially using the class TextDirectoryLoader the set of documents present in a folder traindata are stored into single 'traininput.arff file. This class is defined in the java package weka.core.converters.the command is as follows: java weka.core.converters.TextDirectoryLoader dir d:\traindata > d:\traininput.arff java weka.core.converters.TextDirectoryLoader dir d:\testdata > d:\testninput.arff
Then we can remove stopwords , apply stemming and we can build a word vector usimg the filter StringToWordVector as follows by performing batch processing:
weka.filters.unsupervised.attribute.StringToWo rdVector -b i d:\traininput.arff o d:\trainoutput.arff r d:\testinput.arff s d:\testoutput.arff -R first-last -W 1000 -prune- rate -1.0 -T -I -N 0 -S -stemmer weka.core.stemmers.LovinsStemmer -M 1 - stopwords D:\as\stopwordsnew.txt -tokenizer "weka.core.tokenizers.WordTokenizer - delimiters \" \\r \\t.,;:\\\'\\\"()?!- <>\\n\"" Now apply Nave Bayes classifier on training file to built a model. International Journal of Computer Trends and Technology (IJCTT) volume 11 number 1 May 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page6
This classifier model can be used to classify the test data using weka as follows
The Nave bayes classifier has classified all the instances of training data correctly except one instance and with test data 64% of data is classified correctly where as 36% of data is classified incorrectly. In the same way we can apply n-gram tokenizer on train and test data and then classify the data . apply n-gramtokenizer as follows weka.filters.unsupervised.attribute.StringT oWordVector -b i d:\traininput.arff o d:\trainoutput.arff r d:\testinput.arff s d:\testoutput.arff -R first-last -W 1000 - prune-rate -1.0 -T -I -N 0 -S -stemmer weka.core.stemmers.LovinsStemmer -M 1 stopwords D:\as\stopwordsnew.txt - tokenizer "weka.core.tokenizers.NGramTokenizer - delimiters \" \\r \\t.,;:\\\'\\\"()?!- <>\\n\" - max 3 -min 1" Now apply Nave Bayes classifier on the result 4. Conclusion and Future Work Text Classification is an important application area in text mining why because classifying millions of text document manually is an expensive and time consuming task. Therefore, automatic text classifier is constructed using pre- classified sample documents whose accuracy and time efficiency is much better than manual text classification. If the input to the classifier is having less noisy data, we obtain efficient results. So during mining the text, efficient preprocessing algorithms must be chosen. The test data also should be preprocessed before classifying it. Text can be classified better by identifying patterns . Once patterns are identified we can classify given text or documents efficiently. Identifying efficient patterns also plays major role in text classification.When there is a need for efficient classification of text various methods can be implemented for identifying better patterns for classifying text efficienty. REFERENCES [1] Effective Pattern Discovery for Text mining,IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 1, J ANUARY 2012 [2] A Survey of Text Mining Techniques and Applications, J OURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 1, NO. 1, AUGUST 2009 [3] Recent Trends in Text Classification Techniques,INTERNATIONAL J OURNAL OF COMPUTER APPLICATIONS(0975-8887) VOLUME 35- NO.6,DECEMBER 2011 [4] Learning to Classify Texts Using Positive and Unlabeled Data,Xiaoli Li and Bing Liu [5] Text Classification using String Kernels, J ournal of Machine Learning Research 2 (2002) 419-444 [6] SLPMiner: An Algorithmfor Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint, Masakazu Seno and George Karypis [7] A Concept-based Model for Enhancing Text Categorization, A research paper by Shady Shehata.Fakhri Karray,Mohamed Kamel [8] Text Categorisation: A Survey, Technical Report Raport NR 941, Norwegian Computing Center, 1999 [9] Pierre Baldi, Paolo Frasconi, Padhraic Smyth Modeling the Internet and the Web, Probabilistic Methods and Algorithms,2003 (chapter 4) [http://ibook.ics.uci.edu/Chapter4.pdf ] [10] David J . Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, 2001 [11] J ochen Dijrre, Peter Gerstl, Roland Seiffert, Text Mining: Finding Nuggets in Mountains of Textual Data, KDD 1999.