0% found this document useful (0 votes)
59 views

Ask Analytics - Text Mining in R - Part 3

Uploaded by

norelkys
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Ask Analytics - Text Mining in R - Part 3

Uploaded by

norelkys
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

27/2/2021 Ask Analytics: Text Mining in R - Part 3

More Create

Random Base SAS Advanced SAS SAS 'n' Stats Fundoo Stuff SAS Certification Magic with Excel R Python Interview FAQs Hot Jobs Who we are

On your plate
Text Mining in R - Part 3
statistics R Advanced SAS Base S
Comparison and Commonality Cloud and much more Linear Regression interview Text Mining Logis
Regression cluster analysis Magic of Excel Python Ba
In the previous articles of the series, we covered web scraping SAS certification Decision Science time-ser
and basics of text mining. We have also covered basic word cloud. forecasting Macro ARIMA Market Basket Analysis NLP
Visualization SAS Gems Sentiment Analysis automation C
Now it is time to learn some very useful text functions, web
Dashboards Factor Analysis Principal Component Analy
scraping variants. We will also learn "How to create comparison SAS Projetcs Conjoint Analysis X Commands guesstimate
and commonality type of word cloud" and would learn to analyse
the same.
Ask Analytics
2,093 likes

With reference to first article of the series : Text Mining in R - Part 1

We earlier earlier learned how to extract tweets on a particular Hash Tag. What if Like Page Sign Up
...

Q. Can we need to scrape tweets based on two hash tags occurring together?

Ans. Yes is it very much possible, use :

xyz = searchTwitter("#MannKiBaat AND #NAMO", n=50)

Q.  Can we scrape tweets from specific users timeline, instead of


hashtag basis?

Ans. Well Yes. Example is within the article.

Hash tag based scraping is done when we wants to know opinion of people about
certain topic, timeline based scraping is done to know what a person/institution is
up to. But we should learn both.

I was planning to buy a new mobile connection and I was supposed to choose one from Airtel and
Search This Blog
Vodafone. I thought, I should analyze these companies behavior on Twitter and then see it is
helpful in decision making. In this exercise, I got to learn two things : Comparison Cloud and Searc
Commanlity Cloud
Popular Posts
#---------Let's first make the connection between R and Twitter ---------------------#
Difference between Nodupkey and
Nodup in Proc Sort ?
Consumer_key = "6NY7fDv___________QDT6WtrDK2p"
What is the difference between the
Consumer_secret = "6R06rlKb5LEy3yIb_____________HChZCBzXvgXXHV8V6oZC" Nodupkey and Nodup options in Pro
access_token = "3154348417-u0al6vBfU___________YFQwjQJIjQHeMErdJVI" Sort ? Since ages SAS interviewers
access_token_secret = "0fZ5WxRDNfH________________tsqAtIkhAC0NQQaSVWx" have not stopped asking this q...

Market Basket Analysis in R


# I have masked my credentials, you need to get your own ( If you don't know where you can get it
Market Basket Analysis in R with
from, I believe you have missed the first blog on the Text Mining series. example How can we identify the
different products which can be
if(!require(twitteR)) install.packages("twitteR") bundled together to increase the
sal...
library(twitteR)
setup_twitter_oauth(Consumer_key,Consumer_secret,access_token,access_token_secret) Difference between K-Means and
rm(list = ls()) Hierarchical Clustering - Usage
Optimization
When should I go for K-Means
#------------------CONNECTION DONE------------------------------------# Clustering and when for Hierarchica
www.askanalytics.in/2016/05/text-mining-in-r-part-3.html 1/5
27/2/2021 Ask Analytics: Text Mining in R - Part 3
Clustering ? Often people get confused, which on
of the two i.e. K-Me...
# We would now fetch tweets from Airtel and Vodafone India timeline
Ensemble Technique - Random Fore
#Twitter name for Airtel India : airtelindia in R
#Twitter name for Vodafone India  : VodafoneIN Machine Learning Techniques - I
Machine Learning is a buzz word
these days in the world of data
airtel_tweets = userTimeline("airtelindia", n=500, since = "2016-01-01") science and analytics. R and Python have...
vodafone_tweets = userTimeline("VodafoneIN", n=500, since = "2016-01-01")
Data Aggregation in Python
# we now get the text part of the tweets from both the extracts Python Tutorial 6.0 After learning t
merge and appending in Python, le
airtel_tweets = sapply(airtel_tweets, function(x) x$getText())
now explore how to do aggregation
vodafone_tweets = sapply(vodafone_tweets, function(x) x$getText()) the data using Pyth...

# We now need to clean the extracted text, we will perform cleaning in two phases Difference between Z-score and Z-
test
What is the difference between Z-
# Cleaning Phase 1 -- Twitter specific cleaning -- For this we define a function score and Z-test? Often this questio
# gsub is very useful function, do learn about it. We would write about it soon. is asked in SAS interviews, so what
should be the perfect answer ....
clean.twitter = function(x)
Descriptive Statistics With Proc
{ Univariate
  # remove @ taggings Feel your data ! Before going to a
  x = gsub("@\\w+", "", x) battle, a warrior better know what
he is fighting against and so a data
  # remove punctuations
analyst ! It is advised to ...
  x = gsub("[[:punct:]]", "", x)
  # remove links which starts with http Create your own Google in Excel
  x = gsub("http\\w+", "", x) Learn to create G o o g l e  in Exce
  # remove tabs Excited ? Confused ? ... Don't be, as
you are one click away from learnin
  x = gsub("[ |\t]{2,}", "", x) how to create y...
  # remove blank spaces at the beginning
  x = gsub("^ ", "", x) Understanding p-value
  # remove blank spaces at the end A tale about  p-value You keep seei
the term 'p-value' every now and
  x = gsub(" $", "", x)
then, but don't understand what it
  # remove non english characters really mean...
  x = gsub("[^\x20-\x7E]", "", x)
  return(x) The concepts of Bagging and Boosti
} Ensemble Learning Techniques In on
of the previous posts we covered
Random Forest, one of the most
# Now we shall use the function defined above popular ensemble learning
airtel = clean.twitter(airtel_tweets) techniques...
vodafone = clean.twitter(vodafone_tweets)
Follow by Email
# Let us now make the consolidated vectors with all the tweets related to one entity together
Email address... Subm
airtel_1 = paste(airtel, collapse=" ")
vodafone_1 = paste(vodafone, collapse=" ")

# and now make it one vector, by putting everything in a single vector


The_one = c(airtel_1, vodafone_1)

# Cleaning Phase 2 --  Generic cleaning, which is done by using tm package functions

if(!require(tm)) install.packages("tm")
library(tm)
corpus = Corpus(VectorSource(The_one ))

textCorpus = tm_map(corpus, content_transformer(tolower))


textCorpus = tm_map(textCorpus, removeWords, stopwords("english"))
textCorpus = tm_map(textCorpus, removeNumbers)
textCorpus = tm_map(textCorpus, stripWhitespace)

# Post cleaning, it is time to create Term Document Matrix. Well there are two such matrices
can be made using tm package : 

1. Document Term Matrix (DTM) : A document-term matrix  matrix that describes the frequency
of terms that occur in each and every document. In a document-term matrix, rows correspond to
documents in the collection and columns correspond to terms.

2.  Term Document Matrix (TDM) : Similar to DTM but transpose of it . In a Term-Document
matrix, rows correspond to terms in the collection and columns correspond to documents.
www.askanalytics.in/2016/05/text-mining-in-r-part-3.html 2/5
27/2/2021 Ask Analytics: Text Mining in R - Part 3

# Back to code

tdm = as.matrix(TermDocumentMatrix(textCorpus))
head(tdm)

# It looks like picture in right , we now give name to


column 1 and 2, as  per their respective entity 

colnames(tdm) = c("Airtel", "Vodafone")

Now we shall make the word cloud of two types ( Basic type we have already learnt previously):

Comparison Cloud : Used to check the contrast between two text corpus
Commanlity Cloud : Used to check the common term across various corpus

# Let's make a comparison cloud

if(!require(wordcloud)) install.packages("wordcloud")
library(wordcloud)
# comparison cloud
comparison.cloud(tdm, random.order=FALSE,
                 colors = c("#00B2FF", "red"),
                 title.size=1.5,  min.freq=100, max.words=500)

We can see Airtel twitter handle is


mostly talking about its product,
plans, features or events, Vodafone
on the other hand is mainly replying
to unsatisfied customers. Especially
this guy Amit is writing most of their
tweets.

Thoughts that came to my mind :


EITHER Airtel has got less complaints,
while Vodafone has got too many of
those, OR Vodafone is more focused
towards customer satisfaction and
hence it is using its Twitter handle to
reply to customers complaints unlike
Airtel, who is using it for advertising
its products.

One thing is sure, If I take vodafone connection, I would need to talk to this Amit one day.

# Let's now make a commanlity cloud

commonality.cloud(tdm, random.order=FALSE,
                  colors = brewer.pal(8, "Dark2"),
                  title.size=1.5)

Commanlity Cloud gives an idea about what common terms two (or more) entities are using. In this
case, there is nothing much that can be interpreted.

www.askanalytics.in/2016/05/text-mining-in-r-part-3.html 3/5
27/2/2021 Ask Analytics: Text Mining in R - Part 3

What's next in the series :

We are going to cover few more functions of tm


package, text association, sentiment analysis and
much more, till then ...

Enjoy reading our other articles and stay tuned with


us.

Kindly do provide your feedback in the 'Comments'


Section and share as much as possible.

A humble appeal :  Please do like us @ Facebook

Posted by Unknown

2 comments:

Yogesh June 27, 2019 at 2:33 AM

I admire this article for the well-researched content and excellent wording
seo company in chennai
Reply

for ict 99 October 4, 2019 at 10:18 AM


Great Article
Data Mining Projects IEEE for CSE
Project Centers in Chennai

JavaScript Training in Chennai


JavaScript Training in Chennai
Reply

Enter your comment...

Comment as: Google Accoun

Publish Preview

Do provide us your feedback, it would help us serve your better.

Newer Post Home Older Post

Subscribe to: Post Comments (Atom)

www.askanalytics.in/2016/05/text-mining-in-r-part-3.html 4/5
27/2/2021 Ask Analytics: Text Mining in R - Part 3

Impress with your Solution of previous Fall in love with excel

Copyright 2015: Ask Analytics. Simple theme. Powered by Blogger.

www.askanalytics.in/2016/05/text-mining-in-r-part-3.html 5/5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy