0% found this document useful (0 votes)
477 views8 pages

Case Study On Text Mining

This document discusses text mining techniques and their applications. It begins with an introduction to text mining, explaining that it is used to extract meaningful patterns and knowledge from large amounts of unstructured text data. It then describes several common text mining techniques like information extraction, categorization, clustering, and summarization. The document also discusses some of the key differences between traditional data mining and text mining. Finally, it provides more detailed explanations of several text mining techniques like information extraction, information retrieval, categorization, clustering, and summarization.

Uploaded by

Shanthi Ganesan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
477 views8 pages

Case Study On Text Mining

This document discusses text mining techniques and their applications. It begins with an introduction to text mining, explaining that it is used to extract meaningful patterns and knowledge from large amounts of unstructured text data. It then describes several common text mining techniques like information extraction, categorization, clustering, and summarization. The document also discusses some of the key differences between traditional data mining and text mining. Finally, it provides more detailed explanations of several text mining techniques like information extraction, information retrieval, categorization, clustering, and summarization.

Uploaded by

Shanthi Ganesan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

CASE STUDY ON TEXT MINING: APPLICATION,

ISSUES AND CHALLENGES

ABSTRACT:
Data Mining is the method of retrieving meaningful information from the ocean of
data. The data are in the form of text, audio, video and images. Obtaining information from
these data is not an easy task. It requires different techniques to extract information. Text
mining is one of them.

Text mining is the process of extracting information, pattern or knowledge from


different text documents available on different resources. Every day million bytes data are
added in exiting data. Most of data stored in text documents which are unstructured data and
cannot be used for any processing to extract useful information. So different techniques such
as classification, clustering and information extraction are applied for this purpose. There is a
number of text categorization techniques are developed. Some of them are based on
supervised and some of them unsupervised manner of document arrangement. In this paper
focus is based on Text Mining, different text mining techniques and its application.

INTRODUCTION

The data size in the computer world increases by exponential rates day by day. Every
day million Megabytes data are added in the exiting data. Almost all types of institutions,
organizations and commercial industries store their data in electronically digital form. A
standard amount of text circulates on the Internet in the form of digital libraries, repositories
and other textual information such as blogs, social networks and e-mails. It is a very difficult
task to determining the trends and patterns appropriate to extract appropriate valuable
knowledge from this large volume of data.

Traditional data mining tools cannot handle text data because it takes time and effort
to extract information. Text Mining is a process to extract meaningful and interesting models
for exploring knowledge from textual data sources. Text Mining is a multidisciplinary field
that is based on Data Mining, Information Retrieval, Machine Learning, Statistics, and
Computational Linguistics. Different text mining techniques such as summarization,
classification, clustering, etc., can be applied to extract knowledge. Text extraction processes
the text in natural language that is stored in semi-structured and unstructured format. The
Text Exploring techniques are continuously applied in the industry, the university, the web
applications, internet and other fields. Application area such as search engines, filter emails,
analysis of product suggestions, detection of fraud and social media analysis uses text mining
for the exploration of opinion, characteristics extraction, feeling, predictive analysis and
trend.
Generic process of text mining performs the following steps :
❖ Unstructured data collection from different sources, available in different file formats such
as plain text, web pages, pdf files, etc.
❖ Pre-processing and cleaning operations are performed to detect and remove anomalies.
Cleaning process be sure to capture the real essence of the available text and is performed
to delete the stop words (process of identifying the root of certain words) and index the
data.
❖ Processing and controlling operations are applied to audit and further clean the data set by
automatic processing.
❖ Pattern analysis is implemented by Management Information System (MIS).
❖ Information processed in the above steps is used to extract valuable and relevant
information for effective and timely decision making and trend analysis.

Extraction of information from  different document is a tedious and difficult task. The selection of
suitable technique for mining text reduces the time and  to find the relevant patterns for analysis and
decision making. The main goal of this paper is to analyze different text mining techniques which help
to perform text analytics effectively and efficiently from large amount of data.the issues that arise
during text mining process are identified.
DIFFERENCE BETWEEN DATA MINING AND TEXT MINING :
DATA MINING TEXT MINING

Overview A range of functions to A range of functions to


search for patterns and turn unstructured textual data
relationships in structured into structured information to
data enable data analysis

Data type Structured data from Unstructured textual data


large datasets found in found in emails, documents,
systems such databases, presentations, videos, file
spreadsheets, ERP, CRM and shares, social media and the
accounting applications Internet.

Data retrieval Structured data is Unstructured textual data


homogenous and organized comes in many different
making it easy to retrieve formats and content types
located in a more diverse
range of applications and
systems.

Data preparation Structured data is formal and Linguistic and statistical


formatted facilitating the techniques – including NLP
process of ingesting data into keywording and metatagging
analytical models – must be applied to turn
unstructured into usable
structured data.

Need for taxonomy There is no need to create a As the unstructured text


over-riding taxonomy for text comes in many different
mining forms and formats, there
needs to be an over-riding
taxonomy for the data so that
it can be organized into a
common framework.

TEXT MINING TECHNIQUES:


To teach machine how to analyze, understand and generate text, technologies are
produced by natural language processing. The technologies like information extraction,
summarization, categorization, clustering and information visualization, are used in the text
mining process.
1. Information Extraction

Information extraction refers to the process of extracting meaningful


information from vast chunks of textual data.
This text mining technique focuses on identifying the extraction of entities,
attributes, and their relationships from semi-structured or unstructured texts.
Whatever information is extracted is then stored in a database for future
access and retrieval.
The efficacy and relevancy of the outcomes are checked and evaluated using
precision and recall processes.

2. Information Retrieval
Information Retrieval (IR) refers to the process of extracting relevant and associated
patterns based on a specific set of words or phrases. In this technique, IR systems make use
of different algorithms to track and monitor user behaviors and discover relevant data
accordingly. Google and Yahoo search engines are the two most renowned IR systems.

3. Categorization
This is one of those text mining techniques that is a form of “supervised” learning wherein normal language
texts are assigned to a predefined set of topics depending upon their content. Thus, categorization or
rather  NLP(Naural language processing) is a process of gathering text documents and processing and
analyzing them to uncover the right indexes for each document. The co-referencing method is commonly
used as a part of NLP to extract relevant synonyms and abbreviations from textual data. Today, NLP has
become an automated process used in a host of contexts ranging from personalized commercials delivery
to spam filtering and categorizing web pages under hierarchical definitions, and much more.

4. Clustering
Clustering is one of the most critical text mining techniques. It seeks to identify intrinsic
structures in textual information and organize them into relevant subgroups or ‘clusters’  for
further analysis. A significant challenge in the clustering process is to form meaningful
clusters from the unlabeled textual data without having any prior information on them.
Cluster analysis is a standard text mining tool that assists in data distribution or acts as a pre-
processing step for other text mining algorithms running on detected clusters.

5. Summarisation
Text summarisation refers to the process of automatically generating a compressed
version of a specific text that holds valuable information for the end-user. The aim of this
technique is to browse through multiple text sources to summaries of texts containing a
considerable proportion of information in a concise format, keeping the overall meaning and
intent of the original documents essentially the same. Text summarisation integrates and
combines the various methods that employ text categorization like decision trees, neural
networks, regression models, and swarm intelligence.

6. Natural Language Processing

Natural Language Processing is a challenging problem in the field of Text mining that
uses the concept of artificial intelligence to tackle it. NLP is the study of human language so
that computers can understand natural languages similar to that of humans. By NLP a
computer system is able to remove text mining ambiguities such as homonymy, polysemy,
synonymy and hyponymy. NLP also recognize similar concepts – even if they’ve been
expressed in very different ways. For example, the same word may be spelt differently
(hemophilia/haemophilia, tumor/tumour), the same word may be realized differently in
different contexts (tumor/tumors, suffers/suffered), the same concept may be expressed by
different words entirely (Tylenol / Acetaminophen, heart attack / myocardial infarction).

HOW DOES TEXT MINING WORK:


Text mining helps to analyze large amounts of raw data and find relevant insights.
Combined with machine learning, it can create text analysis models that learn to extract or
classify specific information based on previous information stored.

Even though it may seem like a complicated matter, it can actually be quite simple to get
started with.

❖ The first step is gathering your data. Let’s say you want to analyze conversations
with users through your company’s Intercom live chat. The first you’ll need to do is
generate a document containing this data.

❖ Data can be internal (interactions through chats, emails, surveys, spreadsheets,


databases, etc) or external (information from social media, review sites, news outlets,
and any other websites).

❖ The second step is preparing your data. Text mining systems use several NLP
techniques ― like tokenization, parsing, lemmatization, stemming and stop removal
― to build the inputs of your machine learning model.

EVALUATION:
It is possible to evaluate text extractors by using the same performance metrics as text
classification: accuracy, precision, recall and F1 score. However, these metrics only consider
exact matches as true positives, leaving partial matches aside.
Let’s look at an example:

❖ Suppose you create an address extractor. This could be an example of an exact match
(true positive for the tag Address): ‘6818 Eget St., Tacoma’. However, the output
could also be ‘6818 Eget St.’. In this case, even though it is a partial match, it should
not be considered as a false positive for the tag Address.

❖ To include these partial matches, you should use a performance metric known as
ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE is a family of
metrics that can be used to better evaluate the performance of text extractors than
traditional metrics such as accuracy or F1. How do they work? They calculate the
lengths and number of sequences overlapping between the original text and the
extraction (extracted text).

❖ The ROUGE metrics (the parameters you would use to compare overlapping between
the two texts mentioned above) need to be defined manually. That way, you can
define ROUGE-n metrics (when n is the length of the units), or a ROUGE-L metric if
you intend is to compare the longest common sequence.

CHALLENGES IN TEXT MINING


❖ Information is in unstructured textual form and it’s in natural Language(NL)
❖ Not readily accessible to be used by computers
❖ Dealing with huge collections of documents
❖ Require skilful person to choose which documents that will treat ,and analysis the
output
❖ Require more time
❖ Cost,50,000$ just to software
❖ Large textual database
❖ Almost all publications are also in electronic form
❖ Very high number of possible “dimensions”( but sparse);
❖ All possible word and phrase type in the language!!
❖ Complex and subtle relationship between concepts in text
❖ Noisy data example : spelling mistakes
❖ Word ambiguity and context sensitivity

APPLICATIONS:

❖ text categorization into specific domains for example spam - non spam emails or for
detecting sexually explicit content ;

❖ text clustering to automatically organize a set of documents. Lets say you have a
folder of 200000 documents in .pdf and you want to organize them…. by hand.
❖ sentiment analysis to identify and extract subjective information in documents.
Detect what your customers are saying about your company when they use social
media

❖ concept/entity extraction that is capable of identifying people, places, organizations,


and other entities from documents. Well this has as a limit only the immagination.

❖ document summarization to automatically provide the most important points in the


original document. This is particulary good for news summary

❖ learning relations between named entities. Here an intresting paper : Identifying


Semantic Relations Between Named Entities from Chinese Texts

❖ Another subtask in NLP is also POS or parts-of-speech of the language. In this task
you try to associate part-of-speach—such as nouns, adjectives, verbs—to words in a
text, based on context and relationship to adjacent words.

❖ Another important task in NLP is coreference resolution. It is about understanding


references to multiple entities existing in the text and disambiguating that reference.

❖ Detection of junk Emails: Unwanted or unsolicited materials which are sending as


email by an organization for advertising or promotional purpose are called junk E-
mails. Text Mining techniques are applied to detect unwanted junk e-mails
automatically using classification techniques.

For example: The sentence “Obama told Joe Biden that he should consider running for
president” is one of those phrases which contains coreferences (Joe Biden and he). This task
is considered as a stepping stone in doing more complex tasks such as question answering
and summarization

CONCLUSION:
Today a huge volume of digital data is available in computer world and most of them
in textual form. To extract information from this unstructured document text mining
techniques are applied. This paper presents a brief overview about Text Mining, data mining
and its related terms. Text mining and data mining both are applied for information extraction
using a number of techniques. The major difference between these two processes is based on
source of data on which mining techniques are to be applied. Data mining is applied on
structured data while text mining is performed on unstructured or semi- structured data.
Different text mining techniques are Information Extraction, information retrieval, document
classification and clustering etc. Natural language processing and machine learning algorithm
are also applied to mine text documents. Information plays a vital role for an organization’s
success and generated from extraction of data. The major text mining application area are
fraudulent detection, detection of spam, Customer relationship management, Research and
development etc. In Text Mining , there are some issues such as ambiguities present in text .
A number of researched has been made but still text mining is immature. Processing of
natural language text is very difficult. A lots of research opportunities are available in this
area.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy