Twitter Sentiment Analysis Dss

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

twitter-sentiment-analysis-dss

June 27, 2024

1 Introduction
1. Input: an entity-level sentiment analysis dataset of twitter from Kaggle.
2. Main task: judge the sentiment of the message about the entity. There are three classes in
this dataset: Positive, Negative and Neutral. We regard messages that are not relevant to
the entity (i.e. Irrelevant) as Neutral, then build a model that will be able to automatically
identify emotional states (eg. anger, joy) that people express about the company’s product
on twitter
3. Output: predicted outcomes sentiment, such as Positive, Negative and Neutral, Irrelevant.
4. Method: Text classification. This is one of the most common tasks in NLP It can be used for
a wide range of applications (eg. tagging customer feedback into categories, routing support
tickets according to their language) Another common type of text classification problem is
sentiment analysis which aims to identify the polatity of a given text (+/-)
5. Processing requirement:
• Data Loading and Cleaning: Load Twitter_training datasets and perform data cleaning to
handle missing values and inconsistencies.
• Exploratory Data Analysis (EDA): Use visualizations to understand data distributions, and
key patterns.
• Feature Engineering: Create new features or transform existing ones to improve predictive
modeling.
• Model Building and Evaluation: Train machine learning models to predict outcomes senti-
ment, such as Positive, Negative and Neutral, Irrelevant.

2 Import Needed Modules


[1]: import pandas as pd
import numpy as np
import re
import string
import nltk
import joblib
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
from os import path
from PIL import Image

1
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

from sklearn.model_selection import train_test_split


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

3 Loading the data


[2]: # Read the dataset with name "Emotion_classify_Data.csv" and store it in a␣
↪variable df

columns = ['ID','Entity','Sentiment','Tweet_content']
df = pd.read_csv("/kaggle/input/twitter-entity-sentiment-analysis/
↪twitter_training.csv", names=columns)

# Print the shape of dataframe


print(df.shape)

# Print top 5 rows


df.head(5)

(74682, 4)

[2]: ID Entity Sentiment \


0 2401 Borderlands Positive
1 2401 Borderlands Positive
2 2401 Borderlands Positive
3 2401 Borderlands Positive
4 2401 Borderlands Positive

Tweet_content
0 im getting on borderlands and i will murder yo…
1 I am coming to the borders and I will kill you…
2 im getting on borderlands and i will kill you …
3 im coming on borderlands and i will murder you…
4 im getting on borderlands 2 and i will murder …

[3]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681

2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 74682 non-null int64
1 Entity 74682 non-null object
2 Sentiment 74682 non-null object
3 Tweet_content 73996 non-null object
dtypes: int64(1), object(3)
memory usage: 2.3+ MB

[4]: # Check the distribution of Emotion


df['Sentiment'].value_counts()

[4]: Sentiment
Negative 22542
Positive 20832
Neutral 18318
Irrelevant 12990
Name: count, dtype: int64

[5]: # Show sample


for i in range(5):
print(f"{i+1}: {df['Tweet_content'][i]} -> {df['Sentiment'][i]}")

1: im getting on borderlands and i will murder you all , -> Positive


2: I am coming to the borders and I will kill you all, -> Positive
3: im getting on borderlands and i will kill you all, -> Positive
4: im coming on borderlands and i will murder you all, -> Positive
5: im getting on borderlands 2 and i will murder you me all, -> Positive

4 Preprocessing
4.0.1 Dealing with missing value

[6]: # check missing values


print(df.isnull().sum())

ID 0
Entity 0
Sentiment 0
Tweet_content 686
dtype: int64
There are 686 samples with no text. As text information is crutial for us, we are going to remove
these samples for both EDA anf models fitting.
[7]: # remove missing values
df.dropna(inplace=True)

3
# check missing values
df.isnull().sum()

[7]: ID 0
Entity 0
Sentiment 0
Tweet_content 0
dtype: int64

4.0.2 Dealing with duplicate value

[8]: # check duplicate values


df.duplicated().sum()

[8]: 2340

[9]: # remove duplicate values


remove_duplicates = df.drop_duplicates()
df = remove_duplicates
# check duplicate values
df.duplicated().sum()

[9]: 0

4.0.3 Drop nan values

[10]: df.dropna(inplace=True)

5 EDA
[11]: # Calculate class counts
class_counts = df['Sentiment'].value_counts().reset_index()
class_counts.columns = ['Class', 'Count']

# Calculate the total number of images in df


total_images = len(df)

# Calculate the percentage for each class based on the total number of images
class_counts['Percentage'] = (class_counts['Count'] / total_images) * 100

# Sort the dataframe by count


class_counts = class_counts.sort_values(by='Count', ascending=False)

# Create the pie chart using matplotlib


plt.figure(figsize=(10, 8))

4
plt.pie(class_counts['Percentage'], labels=class_counts['Class'], autopct='%1.
↪1f%%', startangle=140)

plt.title('Proportions of target classes')


plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()

[12]: data=df.groupby(by=["Entity","Sentiment"]).count().reset_index()
data.head()

[12]: Entity Sentiment ID Tweet_content


0 Amazon Irrelevant 185 185
1 Amazon Negative 565 565
2 Amazon Neutral 1197 1197
3 Amazon Positive 302 302
4 ApexLegends Irrelevant 185 185

[13]: #Figure of comparison per Entity


plt.figure(figsize=(20,6))

5
sns.barplot(data=data,x="Entity",y="ID",hue='Sentiment')
plt.xticks(rotation=90)
plt.xlabel("Entity")
plt.ylabel("Number of tweets")
plt.grid()
plt.title("Distribution of tweets per Entity")
plt.show()

[14]: plt.figure(figsize=(10, 6))


count_table = pd.crosstab(index=df['Entity'], columns=df['Sentiment'])
sns.heatmap(count_table, cmap='YlOrRd', annot=True, fmt='d',linewidths=0.5,␣
↪linecolor='black')

plt.title('Sentiment Distribution by Entity')


plt.xlabel('Sentiment')
plt.ylabel('Entity')
plt.show()

6
[15]: # Convert entities to a single string
branches_text = ' '.join(count_table.index)

# Create word cloud


wordcloud = WordCloud(width=800, height=400, background_color='white').
↪generate(branches_text)

# Plot word cloud


plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Branches')
plt.axis('off')
plt.show()

7
[16]: # Concatenate all tweets into a single string
all_tweets_text = ' '.join(df['Tweet_content'])

# Create word cloud


wordcloud = WordCloud(width=800, height=400, background_color='white').
↪generate(all_tweets_text)

# Plot word cloud


plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Tweets')
plt.axis('off')
plt.show()

8
6 Preprocess Function for Model
[17]: # load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")

[18]: # use this utility function to get the preprocessed text data
def preprocess(text):
# remove stop words and lemmatize the text
doc = nlp(text)
filtered_tokens = []
for token in doc:
if token.is_stop or token.is_punct:
continue
filtered_tokens.append(token.lemma_)

return " ".join(filtered_tokens)

Apply preprocess function on dataframe


[19]: df['Preprocessed Text'] = df['Tweet_content'].apply(preprocess)

[20]: df

[20]: ID Entity Sentiment \


0 2401 Borderlands Positive
1 2401 Borderlands Positive

9
2 2401 Borderlands Positive
3 2401 Borderlands Positive
4 2401 Borderlands Positive
… … … …
74677 9200 Nvidia Positive
74678 9200 Nvidia Positive
74679 9200 Nvidia Positive
74680 9200 Nvidia Positive
74681 9200 Nvidia Positive

Tweet_content \
0 im getting on borderlands and i will murder yo…
1 I am coming to the borders and I will kill you…
2 im getting on borderlands and i will kill you …
3 im coming on borderlands and i will murder you…
4 im getting on borderlands 2 and i will murder …
… …
74677 Just realized that the Windows partition of my…
74678 Just realized that my Mac window partition is …
74679 Just realized the windows partition of my Mac …
74680 Just realized between the windows partition of…
74681 Just like the windows partition of my Mac is l…

Preprocessed Text
0 m get borderland murder
1 come border kill
2 m get borderland kill
3 m come borderland murder
4 m get borderland 2 murder
… …
74677 realize Windows partition Mac like 6 year Nvid…
74678 realize Mac window partition 6 year Nvidia dri…
74679 realize window partition Mac 6 year Nvidia dri…
74680 realize window partition Mac like 6 year Nvidi…
74681 like window partition Mac like 6 year driver i…

[71656 rows x 5 columns]

Encoding target column


[21]: le_model = LabelEncoder()
df['Sentiment'] = le_model.fit_transform(df['Sentiment'])

[22]: df.head(5)

[22]: ID Entity Sentiment \


0 2401 Borderlands 3

10
1 2401 Borderlands 3
2 2401 Borderlands 3
3 2401 Borderlands 3
4 2401 Borderlands 3

Tweet_content \
0 im getting on borderlands and i will murder yo…
1 I am coming to the borders and I will kill you…
2 im getting on borderlands and i will kill you …
3 im coming on borderlands and i will murder you…
4 im getting on borderlands 2 and i will murder …

Preprocessed Text
0 m get borderland murder
1 come border kill
2 m get borderland kill
3 m come borderland murder
4 m get borderland 2 murder

Split data into train and test


[23]: X_train, X_test, y_train, y_test = train_test_split(df['Preprocessed Text'],␣
↪df['Sentiment'],

test_size=0.2,␣
↪random_state=42, stratify=df['Sentiment'])

[24]: print("Shape of X_train: ", X_train.shape)


print("Shape of X_test: ", X_test.shape)

Shape of X_train: (57324,)


Shape of X_test: (14332,)

7 Machine Learning Model


Two models are used: Multinomial Naive Bayes and Random Forest Classifier.
Metrics Selection * Accuracy * Classification Report * Confusion Matrix
Stop condition when training models * Multinomial Naive Bayes: Training stops once
the fit method completes, having processed all training samples. * Random Forest Classifier:
Training stops once the fit method completes, having trained the specified number of trees
(n_estimators) and adhering to any other stopping criteria (e.g., max_depth, min_samples_split,
min_samples_leaf).

11
7.0.1 Naive Bayes Model

[25]: # Create classifier


# A pipeline is created to streamline the preprocessing and model training␣
↪steps.

clf = Pipeline([
('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', (MultinomialNB()))
])

[26]: # Model training using the training data


clf.fit(X_train, y_train)

[26]: Pipeline(steps=[('vectorizer_tri_grams', TfidfVectorizer()),


('naive_bayes', MultinomialNB())])

[27]: # Get prediction


y_pred = clf.predict(X_test)

[28]: # Print score


print(accuracy_score(y_test, y_pred))

0.7229277142059727

[29]: # Print classification report


print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.94 0.44 0.60 2507


1 0.64 0.90 0.75 4340
2 0.84 0.64 0.73 3542
3 0.71 0.79 0.74 3943

accuracy 0.72 14332


macro avg 0.78 0.69 0.70 14332
weighted avg 0.76 0.72 0.72 14332

7.0.2 Random Forest


[30]: clf = Pipeline([
('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', (RandomForestClassifier()))
])

[31]: clf.fit(X_train, y_train)

12
[31]: Pipeline(steps=[('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', RandomForestClassifier())])

[32]: # Get the predictions for X_test and store it in y_pred


y_pred = clf.predict(X_test)

[33]: # Print Accuracy


print(accuracy_score(y_test, y_pred))

0.9108289143176109

[34]: # Print the classfication report


print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.97 0.84 0.90 2507


1 0.93 0.93 0.93 4340
2 0.94 0.90 0.92 3542
3 0.85 0.94 0.89 3943

accuracy 0.91 14332


macro avg 0.92 0.90 0.91 14332
weighted avg 0.91 0.91 0.91 14332

Based on the displayed results, we see that both models have very good indexes, in which Random
Forest Classifier has superior results with accuracy score, precision, recall, f1- score and support
are all very high.

8 Test Model
Get text
[35]: test_df = pd.read_csv('/kaggle/input/twitter-entity-sentiment-analysis/
↪twitter_validation.csv', names=columns)

test_df.head()

[35]: ID Entity Sentiment \


0 3364 Facebook Irrelevant
1 352 Amazon Neutral
2 8312 Microsoft Negative
3 4371 CS-GO Negative
4 4433 Google Neutral

Tweet_content
0 I mentioned on Facebook that I was struggling …
1 BBC News - Amazon boss Jeff Bezos rejects clai…

13
2 @Microsoft Why do I pay for WORD when it funct…
3 CSGO matchmaking is so full of closet hacking,…
4 Now the President is slapping Americans in the…

[36]: test_text = test_df['Tweet_content'][10]


print(f"{test_text} ===> {test_df['Sentiment'][10]}")

The professional dota 2 scene is fucking exploding and I completely welcome it.

Get the garbage out. ===> Positive


Apply preprocess
[37]: test_text_processed = [preprocess(test_text)]
test_text_processed

[37]: ['professional dota 2 scene fucking explode completely welcome \n\n garbage']

Get Prediction
[38]: test_text = clf.predict(test_text_processed)

Output
[39]: classes = ['Irrelevant', 'Natural', 'Negative', 'Positive']

print(f"True Sentiment: {test_df['Sentiment'][10]}")


print(f'Predict Sentiment: {classes[test_text[0]]}')

True Sentiment: Positive


Predict Sentiment: Positive
Irrelevant : 0 Natural : 1 Negative: 2 Positive: 3

9 Applicability of research results


The results of this sentiment analysis and other machine learning models can be applied in var-
ious future scenarios. * Personalized Marketing: Use sentiment analysis to tailor marketing
strategies based on customer sentiments. Personalized marketing campaigns can improve customer
engagement and conversion rates by targeting customers with content that resonates with their
sentiments and preferences. * Product Development and Improvement: Analyzing customer
feedback on new product features or updates helps product development teams understand what
features are well-received and which ones need improvement, leading to better product iterations
and increased user satisfaction. * Trend Analysis: Monitor sentiment trends over time to gauge
public opinion on various topics, products, or brands. This can provides insights into market trends
and consumer preferences, allowing businesses to stay ahead of the competition and make informed
strategic decisions.

14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy