Case Study On Text Mining
Case Study On Text Mining
ABSTRACT:
Data Mining is the method of retrieving meaningful information from the ocean of
data. The data are in the form of text, audio, video and images. Obtaining information from
these data is not an easy task. It requires different techniques to extract information. Text
mining is one of them.
INTRODUCTION
The data size in the computer world increases by exponential rates day by day. Every
day million Megabytes data are added in the exiting data. Almost all types of institutions,
organizations and commercial industries store their data in electronically digital form. A
standard amount of text circulates on the Internet in the form of digital libraries, repositories
and other textual information such as blogs, social networks and e-mails. It is a very difficult
task to determining the trends and patterns appropriate to extract appropriate valuable
knowledge from this large volume of data.
Traditional data mining tools cannot handle text data because it takes time and effort
to extract information. Text Mining is a process to extract meaningful and interesting models
for exploring knowledge from textual data sources. Text Mining is a multidisciplinary field
that is based on Data Mining, Information Retrieval, Machine Learning, Statistics, and
Computational Linguistics. Different text mining techniques such as summarization,
classification, clustering, etc., can be applied to extract knowledge. Text extraction processes
the text in natural language that is stored in semi-structured and unstructured format. The
Text Exploring techniques are continuously applied in the industry, the university, the web
applications, internet and other fields. Application area such as search engines, filter emails,
analysis of product suggestions, detection of fraud and social media analysis uses text mining
for the exploration of opinion, characteristics extraction, feeling, predictive analysis and
trend.
Generic process of text mining performs the following steps :
❖ Unstructured data collection from different sources, available in different file formats such
as plain text, web pages, pdf files, etc.
❖ Pre-processing and cleaning operations are performed to detect and remove anomalies.
Cleaning process be sure to capture the real essence of the available text and is performed
to delete the stop words (process of identifying the root of certain words) and index the
data.
❖ Processing and controlling operations are applied to audit and further clean the data set by
automatic processing.
❖ Pattern analysis is implemented by Management Information System (MIS).
❖ Information processed in the above steps is used to extract valuable and relevant
information for effective and timely decision making and trend analysis.
Extraction of information from different document is a tedious and difficult task. The selection of
suitable technique for mining text reduces the time and to find the relevant patterns for analysis and
decision making. The main goal of this paper is to analyze different text mining techniques which help
to perform text analytics effectively and efficiently from large amount of data.the issues that arise
during text mining process are identified.
DIFFERENCE BETWEEN DATA MINING AND TEXT MINING :
DATA MINING TEXT MINING
2. Information Retrieval
Information Retrieval (IR) refers to the process of extracting relevant and associated
patterns based on a specific set of words or phrases. In this technique, IR systems make use
of different algorithms to track and monitor user behaviors and discover relevant data
accordingly. Google and Yahoo search engines are the two most renowned IR systems.
3. Categorization
This is one of those text mining techniques that is a form of “supervised” learning wherein normal language
texts are assigned to a predefined set of topics depending upon their content. Thus, categorization or
rather NLP(Naural language processing) is a process of gathering text documents and processing and
analyzing them to uncover the right indexes for each document. The co-referencing method is commonly
used as a part of NLP to extract relevant synonyms and abbreviations from textual data. Today, NLP has
become an automated process used in a host of contexts ranging from personalized commercials delivery
to spam filtering and categorizing web pages under hierarchical definitions, and much more.
4. Clustering
Clustering is one of the most critical text mining techniques. It seeks to identify intrinsic
structures in textual information and organize them into relevant subgroups or ‘clusters’ for
further analysis. A significant challenge in the clustering process is to form meaningful
clusters from the unlabeled textual data without having any prior information on them.
Cluster analysis is a standard text mining tool that assists in data distribution or acts as a pre-
processing step for other text mining algorithms running on detected clusters.
5. Summarisation
Text summarisation refers to the process of automatically generating a compressed
version of a specific text that holds valuable information for the end-user. The aim of this
technique is to browse through multiple text sources to summaries of texts containing a
considerable proportion of information in a concise format, keeping the overall meaning and
intent of the original documents essentially the same. Text summarisation integrates and
combines the various methods that employ text categorization like decision trees, neural
networks, regression models, and swarm intelligence.
Natural Language Processing is a challenging problem in the field of Text mining that
uses the concept of artificial intelligence to tackle it. NLP is the study of human language so
that computers can understand natural languages similar to that of humans. By NLP a
computer system is able to remove text mining ambiguities such as homonymy, polysemy,
synonymy and hyponymy. NLP also recognize similar concepts – even if they’ve been
expressed in very different ways. For example, the same word may be spelt differently
(hemophilia/haemophilia, tumor/tumour), the same word may be realized differently in
different contexts (tumor/tumors, suffers/suffered), the same concept may be expressed by
different words entirely (Tylenol / Acetaminophen, heart attack / myocardial infarction).
Even though it may seem like a complicated matter, it can actually be quite simple to get
started with.
❖ The first step is gathering your data. Let’s say you want to analyze conversations
with users through your company’s Intercom live chat. The first you’ll need to do is
generate a document containing this data.
❖ The second step is preparing your data. Text mining systems use several NLP
techniques ― like tokenization, parsing, lemmatization, stemming and stop removal
― to build the inputs of your machine learning model.
EVALUATION:
It is possible to evaluate text extractors by using the same performance metrics as text
classification: accuracy, precision, recall and F1 score. However, these metrics only consider
exact matches as true positives, leaving partial matches aside.
Let’s look at an example:
❖ Suppose you create an address extractor. This could be an example of an exact match
(true positive for the tag Address): ‘6818 Eget St., Tacoma’. However, the output
could also be ‘6818 Eget St.’. In this case, even though it is a partial match, it should
not be considered as a false positive for the tag Address.
❖ To include these partial matches, you should use a performance metric known as
ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE is a family of
metrics that can be used to better evaluate the performance of text extractors than
traditional metrics such as accuracy or F1. How do they work? They calculate the
lengths and number of sequences overlapping between the original text and the
extraction (extracted text).
❖ The ROUGE metrics (the parameters you would use to compare overlapping between
the two texts mentioned above) need to be defined manually. That way, you can
define ROUGE-n metrics (when n is the length of the units), or a ROUGE-L metric if
you intend is to compare the longest common sequence.
APPLICATIONS:
❖ text categorization into specific domains for example spam - non spam emails or for
detecting sexually explicit content ;
❖ text clustering to automatically organize a set of documents. Lets say you have a
folder of 200000 documents in .pdf and you want to organize them…. by hand.
❖ sentiment analysis to identify and extract subjective information in documents.
Detect what your customers are saying about your company when they use social
media
❖ Another subtask in NLP is also POS or parts-of-speech of the language. In this task
you try to associate part-of-speach—such as nouns, adjectives, verbs—to words in a
text, based on context and relationship to adjacent words.
For example: The sentence “Obama told Joe Biden that he should consider running for
president” is one of those phrases which contains coreferences (Joe Biden and he). This task
is considered as a stepping stone in doing more complex tasks such as question answering
and summarization
CONCLUSION:
Today a huge volume of digital data is available in computer world and most of them
in textual form. To extract information from this unstructured document text mining
techniques are applied. This paper presents a brief overview about Text Mining, data mining
and its related terms. Text mining and data mining both are applied for information extraction
using a number of techniques. The major difference between these two processes is based on
source of data on which mining techniques are to be applied. Data mining is applied on
structured data while text mining is performed on unstructured or semi- structured data.
Different text mining techniques are Information Extraction, information retrieval, document
classification and clustering etc. Natural language processing and machine learning algorithm
are also applied to mine text documents. Information plays a vital role for an organization’s
success and generated from extraction of data. The major text mining application area are
fraudulent detection, detection of spam, Customer relationship management, Research and
development etc. In Text Mining , there are some issues such as ambiguities present in text .
A number of researched has been made but still text mining is immature. Processing of
natural language text is very difficult. A lots of research opportunities are available in this
area.