Sms Text Classification
Sms Text Classification
Course project
for
Inspiration for this project comes from a case study on U-Report [1], a SMS interface
developed by UNICEF Uganda, to collect information and complaints from the locals
and direct them quickly and efficiently to relevant departments. The internet
infrastructure at the time of the development of this service was very poor and most of
the population used cellphones (not smart phones). So essentially, this was a way to
provide a forum to public to share information and submit their grievances.
The problem we are trying to solve is that of an SMS text classification. Around 200,000
messages are received on the portal every day. The objective of the program in Uganda
is to understand the data in real-time, and have issues addressed by the appropriate
departments in UNICEF in a timely manner. Given the high volume and velocity of the
data streams, manual inspection of all messages is no longer sustainable [1].
The problem and solution were floated in early 2010s (2012-2014) and much
development in text analysis has been done since then. Contemporarily speaking, I
believe, the method I illustrate below, would have had a good chance at solving this
problem.
Not only can this problem be used to provide immediate relief in relevant areas, but also
can help in identifying the major problems each geographic location and can help
UNICEF run campaigns to address those issues.
Data collection
Since this is a text classification problem, we will be collecting the following data:
1. Text data
2. Text location – geo-tags or the cell tower information
Apart from this data, since we will be making use of supervised learning algorithms in
conjunction with other models, it is necessary that we have labels to each of the data
point we collect. To solve this, we can make use of data that was in place before the
solution is implemented and use that data. We can label the texts based on the
departments they were forwarded to. We can also employ human labelling but that will
depend on the time constraint and budget of the problem. The end goal for the data
collection process is to have a labelled data point and the most economical (in terms of
time and labor) is the first approach.
Data preprocessing
Stop word removal
We will also need to preprocess this data before we feed it to our algorithm. For this we
need to keep in mind a few things about the text data. It is largely going to be
unorganized and might contain words which might be of no relevance to the context of
the text but help in understanding the text better, words like ‘the’, ‘on’, ‘a’, etc. Such
words are called stop words. Essentially a statement like ‘The pizza place on the Times-
Square is simply the best.’ is readable, but the context is given by just 3 words in a
statement comprising of 10 – ‘pizza’, ‘Times-Square’, and ‘best’. So, we need to
remove such stop words from the actual text.
Stemming and lemmatization
The second step in our preprocessing is called stemming. In this process we will attempt
to reduce the words in the text to their root form. For example, in our Pizza example,
‘simply’ ’s root form becomes simple (or simpl, based on how the algorithm is
implemented). Next, we perform lemmatization, where we reduce all the superlatives
to their base form, e.g., ‘best’ becomes ‘good’.
As an additional step we can perform spell checks and abbreviation expansion as well.
These steps can be beneficial when we are dealing with a vocabulary which is highly
prone to be linguistically altered, such as in cases when there is a character limit.
Common examples are using ‘b4’ instead of ‘before’ and so on. SMS have character
limits, so this would be a good place to enforce these methods as well. There are libraries
in python which help in performing the preprocessing steps (NLTK, and sklearn).
Classification
Clustering
Given:
• Text after preprocessing
Use:
• TF-IDF
• K-Means Clustering
To:
• Calculate score of each word and convert it into a numerical data form a
linguistic data
• Cluster similar words together
Random Forest Classifier
Once our corpus(vocabulary) is clustered, for each text we identify a count of words
from each cluster. Thus, we now have new extracted features which represent how much
each cluster represents in the received text. We use this data to train a random forest
classifier. The labels are provided with our data and thus this becomes a supervised
learning approach, where we segregate each text into different labels.
Given:
• Word count from each cluster for each text
• Label for each text
Use:
• Random forest Classifier
To:
• Classify each text into multiple relevant categories as required for administration
Statistical frequency calculation
With data classified into different categories for administration, we can go one step
further by clustering the text based on its geographic origin and with the help of
predicted labels, we can identify which issues are prevalent in the region and help
UNICEF launch targeted campaigns.
Given:
• Labelled text messages from Random Forest Classifier
• Location of origin each text
Use:
• Statistical frequency calculation
To:
• To identify which issues are impacting any given region the most.
Clustering is a heuristic approach to solving a problem. This means that it is a quick and
efficient but not an optimum solution. The use-case for the solution was to analyze data in real
time and forward it to relevant departments. Thus, the solution is more inclined towards speed
than accuracy. The other alternative, as discussed in a later section, was to use transformer
models or neural networks. These often provide an optimized solution but require a lot of
hyperparameter optimization and build time (depending on number of hidden layers and
neurons).
Why random forest?
TF-IDF of a term tells us about its relative significance in the document. The intuition is that a
term is more relevant to a document if it is occurring more in one document (term frequency)
and less in others (inverse document frequency). For example, the term ‘water’ is more
occurring in a text containing complaints information about water and relatively less in all the
documents put together. The term ‘the’, however, will occur at least once in every text. Its TF
will high (but that doesn’t mean the text is about ‘the’, it is about water), but its IDF will be
very low, thus bringing its overall score down. Words with high TF-IDF score are essential our
cluster centers or topics we want to form cluster around.
Reality check
To cluster words with similar semantics might require manual intervention and tuning, i.e.,
using K-Means as a linguistic cluster mechanism is not going to be the most straight forward
approach. There are several preprocessing steps which require careful tweaking of algorithms
to suit our use-case. A more feasible approach would be to use off-the-shelf algorithms from
some of the latest libraries and linguistic research. Some of these algorithms employ modified
K-Means under the hood to meet the goal.
Even with proper clustering algorithm in place, we might only reach a certain level of
performance level unless we deploy something more sophisticated. Since most clustering are
heuristics, it is not guaranteed to give the best results. Any number between 70-80% in such
cases can be considered as peak performance (highly optimistic). However, there are
techniques in place which have proven to be much more effective (discussed below).
There might be other factors involved like non-English linguistic patterns and words. We are
not using any language-based model but spelling mistakes might throw model training off.
Alternate approaches
Neural networks have proven to be very effective in the latest research for linguistic patterns
and natural language processing. The reason is that it tries to learn and identify patterns in the
text that are hard to define with logic. Neural Network architectures like Long Short-Term
Memory (LSTM) have been industry standard for quite some time now. Addition of attention
mechanisms have also led to its wider application in NLP.
Built on the attention mechanisms from LSTM, in recent years (2017 onwards), Transformer
models are emerging as the de-facto models for developing solutions linguistic problems.
These models, however, are trained on a huge corpus and thus have huge memory and
computation requirements. A smaller variation fine-tuned for specific tasks can be deployed as
well but requires deeper understanding of the original variation of the model and the results are
also harder to explain to stakeholders as is the case for majority of Neural Network models.
Support Vector Classifiers are a good alternative to random forest classifier solution proposed
earlier. A soft classifier is going to be a better choice given there might be overlap between
certain categories of text. The reason to prefer random forest over SVC is just in the ease of
optimization of the model. Although random forest is not just one tree, it is a collection of
multiple trees, and the result is aggregated for all such trees, it helps avoiding overfitting. SVC,
on the other hand, gives one classifier, whose results are easier to explain, but optimization and
prevention of overfitting is an added step. There is no clear winner among the two for the
current scenario and it comes down to personal preference (there might be performance
differences as well, but we will not know unless both are tested), and thus it would be better to
test both approaches before deployment.
Summary
Data Preprocessing
SMS classification
1. We will cluster the words which seem similar. Each cluster talks about different
groups like water, child labor, crime, food etc.
2. For each text we count how many words from each cluster it contains.
3. We will use the extracted features to train a random forest classifier. The features
will be word count in each cluster as identified from step 2.
4. We then cluster the outcomes based on geographic location to identify
campaigns to run.
Works Cited