Twitter Sentiment Analysis Dss
Twitter Sentiment Analysis Dss
Twitter Sentiment Analysis Dss
1 Introduction
1. Input: an entity-level sentiment analysis dataset of twitter from Kaggle.
2. Main task: judge the sentiment of the message about the entity. There are three classes in
this dataset: Positive, Negative and Neutral. We regard messages that are not relevant to
the entity (i.e. Irrelevant) as Neutral, then build a model that will be able to automatically
identify emotional states (eg. anger, joy) that people express about the company’s product
on twitter
3. Output: predicted outcomes sentiment, such as Positive, Negative and Neutral, Irrelevant.
4. Method: Text classification. This is one of the most common tasks in NLP It can be used for
a wide range of applications (eg. tagging customer feedback into categories, routing support
tickets according to their language) Another common type of text classification problem is
sentiment analysis which aims to identify the polatity of a given text (+/-)
5. Processing requirement:
• Data Loading and Cleaning: Load Twitter_training datasets and perform data cleaning to
handle missing values and inconsistencies.
• Exploratory Data Analysis (EDA): Use visualizations to understand data distributions, and
key patterns.
• Feature Engineering: Create new features or transform existing ones to improve predictive
modeling.
• Model Building and Evaluation: Train machine learning models to predict outcomes senti-
ment, such as Positive, Negative and Neutral, Irrelevant.
1
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
columns = ['ID','Entity','Sentiment','Tweet_content']
df = pd.read_csv("/kaggle/input/twitter-entity-sentiment-analysis/
↪twitter_training.csv", names=columns)
(74682, 4)
Tweet_content
0 im getting on borderlands and i will murder yo…
1 I am coming to the borders and I will kill you…
2 im getting on borderlands and i will kill you …
3 im coming on borderlands and i will murder you…
4 im getting on borderlands 2 and i will murder …
[3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 74682 non-null int64
1 Entity 74682 non-null object
2 Sentiment 74682 non-null object
3 Tweet_content 73996 non-null object
dtypes: int64(1), object(3)
memory usage: 2.3+ MB
[4]: Sentiment
Negative 22542
Positive 20832
Neutral 18318
Irrelevant 12990
Name: count, dtype: int64
4 Preprocessing
4.0.1 Dealing with missing value
ID 0
Entity 0
Sentiment 0
Tweet_content 686
dtype: int64
There are 686 samples with no text. As text information is crutial for us, we are going to remove
these samples for both EDA anf models fitting.
[7]: # remove missing values
df.dropna(inplace=True)
3
# check missing values
df.isnull().sum()
[7]: ID 0
Entity 0
Sentiment 0
Tweet_content 0
dtype: int64
[8]: 2340
[9]: 0
[10]: df.dropna(inplace=True)
5 EDA
[11]: # Calculate class counts
class_counts = df['Sentiment'].value_counts().reset_index()
class_counts.columns = ['Class', 'Count']
# Calculate the percentage for each class based on the total number of images
class_counts['Percentage'] = (class_counts['Count'] / total_images) * 100
4
plt.pie(class_counts['Percentage'], labels=class_counts['Class'], autopct='%1.
↪1f%%', startangle=140)
[12]: data=df.groupby(by=["Entity","Sentiment"]).count().reset_index()
data.head()
5
sns.barplot(data=data,x="Entity",y="ID",hue='Sentiment')
plt.xticks(rotation=90)
plt.xlabel("Entity")
plt.ylabel("Number of tweets")
plt.grid()
plt.title("Distribution of tweets per Entity")
plt.show()
6
[15]: # Convert entities to a single string
branches_text = ' '.join(count_table.index)
7
[16]: # Concatenate all tweets into a single string
all_tweets_text = ' '.join(df['Tweet_content'])
8
6 Preprocess Function for Model
[17]: # load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")
[18]: # use this utility function to get the preprocessed text data
def preprocess(text):
# remove stop words and lemmatize the text
doc = nlp(text)
filtered_tokens = []
for token in doc:
if token.is_stop or token.is_punct:
continue
filtered_tokens.append(token.lemma_)
[20]: df
9
2 2401 Borderlands Positive
3 2401 Borderlands Positive
4 2401 Borderlands Positive
… … … …
74677 9200 Nvidia Positive
74678 9200 Nvidia Positive
74679 9200 Nvidia Positive
74680 9200 Nvidia Positive
74681 9200 Nvidia Positive
Tweet_content \
0 im getting on borderlands and i will murder yo…
1 I am coming to the borders and I will kill you…
2 im getting on borderlands and i will kill you …
3 im coming on borderlands and i will murder you…
4 im getting on borderlands 2 and i will murder …
… …
74677 Just realized that the Windows partition of my…
74678 Just realized that my Mac window partition is …
74679 Just realized the windows partition of my Mac …
74680 Just realized between the windows partition of…
74681 Just like the windows partition of my Mac is l…
Preprocessed Text
0 m get borderland murder
1 come border kill
2 m get borderland kill
3 m come borderland murder
4 m get borderland 2 murder
… …
74677 realize Windows partition Mac like 6 year Nvid…
74678 realize Mac window partition 6 year Nvidia dri…
74679 realize window partition Mac 6 year Nvidia dri…
74680 realize window partition Mac like 6 year Nvidi…
74681 like window partition Mac like 6 year driver i…
[22]: df.head(5)
10
1 2401 Borderlands 3
2 2401 Borderlands 3
3 2401 Borderlands 3
4 2401 Borderlands 3
Tweet_content \
0 im getting on borderlands and i will murder yo…
1 I am coming to the borders and I will kill you…
2 im getting on borderlands and i will kill you …
3 im coming on borderlands and i will murder you…
4 im getting on borderlands 2 and i will murder …
Preprocessed Text
0 m get borderland murder
1 come border kill
2 m get borderland kill
3 m come borderland murder
4 m get borderland 2 murder
test_size=0.2,␣
↪random_state=42, stratify=df['Sentiment'])
11
7.0.1 Naive Bayes Model
clf = Pipeline([
('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', (MultinomialNB()))
])
0.7229277142059727
12
[31]: Pipeline(steps=[('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', RandomForestClassifier())])
0.9108289143176109
Based on the displayed results, we see that both models have very good indexes, in which Random
Forest Classifier has superior results with accuracy score, precision, recall, f1- score and support
are all very high.
8 Test Model
Get text
[35]: test_df = pd.read_csv('/kaggle/input/twitter-entity-sentiment-analysis/
↪twitter_validation.csv', names=columns)
test_df.head()
Tweet_content
0 I mentioned on Facebook that I was struggling …
1 BBC News - Amazon boss Jeff Bezos rejects clai…
13
2 @Microsoft Why do I pay for WORD when it funct…
3 CSGO matchmaking is so full of closet hacking,…
4 Now the President is slapping Americans in the…
The professional dota 2 scene is fucking exploding and I completely welcome it.
[37]: ['professional dota 2 scene fucking explode completely welcome \n\n garbage']
Get Prediction
[38]: test_text = clf.predict(test_text_processed)
Output
[39]: classes = ['Irrelevant', 'Natural', 'Negative', 'Positive']
14