Twitter Sentiment Analysis Dss

twitter-sentiment-analysis-dss
June 27, 2024
1 Introduction
1. Input: an entity-level sentiment analysis dataset of twitter from Kaggle.
2. Main task: judge the sentiment of the message about the entity. There are three classes in
this dataset: Positive, Negative and Neutral. We regard messages that are not relevant to
the entity (i.e. Irrelevant) as Neutral, then build a model that will be able to automatically
identify emotional states (eg. anger, joy) that people express about the company’s product
on twitter
3. Output: predicted outcomes sentiment, such as Positive, Negative and Neutral, Irrelevant.
4. Method: Text classification. This is one of the most common tasks in NLP It can be used for
a wide range of applications (eg. tagging customer feedback into categories, routing support
tickets according to their language) Another common type of text classification problem is
sentiment analysis which aims to identify the polatity of a given text (+/-)
5. Processing requirement:
• Data Loading and Cleaning: Load Twitter_training datasets and perform data cleaning to
handle missing values and inconsistencies.
• Exploratory Data Analysis (EDA): Use visualizations to understand data distributions, and
key patterns.
• Feature Engineering: Create new features or transform existing ones to improve predictive
modeling.
• Model Building and Evaluation: Train machine learning models to predict outcomes senti-
ment, such as Positive, Negative and Neutral, Irrelevant.
2 Import Needed Modules

[1]: import pandas as pd
import numpy as np
import re
import string
import nltk
import joblib
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
from os import path
from PIL import Image
1
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
3 Loading the data

[2]: # Read the dataset with name "Emotion_classify_Data.csv" and store it in a␣
↪variable df
columns = ['ID','Entity','Sentiment','Tweet_content']
df = pd.read_csv("/kaggle/input/twitter-entity-sentiment-analysis/
↪twitter_training.csv", names=columns)
# Print the shape of dataframe

print(df.shape)
# Print top 5 rows

df.head(5)
(74682, 4)
[2]: ID Entity Sentiment \

0 2401 Borderlands Positive
Tweet_content
0 im getting on borderlands and i will murder yo…
1 I am coming to the borders and I will kill you…
2 im getting on borderlands and i will kill you …
3 im coming on borderlands and i will murder you…
4 im getting on borderlands 2 and i will murder …
[3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 74682 non-null int64
1 Entity 74682 non-null object
2 Sentiment 74682 non-null object
3 Tweet_content 73996 non-null object
dtypes: int64(1), object(3)
memory usage: 2.3+ MB
[4]: # Check the distribution of Emotion

df['Sentiment'].value_counts()
[4]: Sentiment
Negative 22542
Positive 20832
Neutral 18318
Irrelevant 12990
Name: count, dtype: int64
[5]: # Show sample

for i in range(5):
print(f"{i+1}: {df['Tweet_content'][i]} -> {df['Sentiment'][i]}")
1: im getting on borderlands and i will murder you all , -> Positive

2: I am coming to the borders and I will kill you all, -> Positive
3: im getting on borderlands and i will kill you all, -> Positive
4: im coming on borderlands and i will murder you all, -> Positive
5: im getting on borderlands 2 and i will murder you me all, -> Positive
4 Preprocessing
4.0.1 Dealing with missing value
[6]: # check missing values

print(df.isnull().sum())
ID 0
Entity 0
Sentiment 0
Tweet_content 686
dtype: int64
There are 686 samples with no text. As text information is crutial for us, we are going to remove
these samples for both EDA anf models fitting.
[7]: # remove missing values
df.dropna(inplace=True)
3
# check missing values
df.isnull().sum()
[7]: ID 0
Entity 0
Sentiment 0
Tweet_content 0
dtype: int64
4.0.2 Dealing with duplicate value
[8]: # check duplicate values

df.duplicated().sum()
[8]: 2340
[9]: # remove duplicate values

remove_duplicates = df.drop_duplicates()
df = remove_duplicates
# check duplicate values
df.duplicated().sum()
[9]: 0
4.0.3 Drop nan values
[10]: df.dropna(inplace=True)
5 EDA
[11]: # Calculate class counts
class_counts = df['Sentiment'].value_counts().reset_index()
class_counts.columns = ['Class', 'Count']
# Calculate the total number of images in df

total_images = len(df)
# Calculate the percentage for each class based on the total number of images
class_counts['Percentage'] = (class_counts['Count'] / total_images) * 100
# Sort the dataframe by count

class_counts = class_counts.sort_values(by='Count', ascending=False)
# Create the pie chart using matplotlib

plt.figure(figsize=(10, 8))
4
plt.pie(class_counts['Percentage'], labels=class_counts['Class'], autopct='%1.
↪1f%%', startangle=140)
plt.title('Proportions of target classes')

plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()
[12]: data=df.groupby(by=["Entity","Sentiment"]).count().reset_index()
data.head()
[12]: Entity Sentiment ID Tweet_content

0 Amazon Irrelevant 185 185
1 Amazon Negative 565 565
2 Amazon Neutral 1197 1197
3 Amazon Positive 302 302
4 ApexLegends Irrelevant 185 185
[13]: #Figure of comparison per Entity

plt.figure(figsize=(20,6))
5
sns.barplot(data=data,x="Entity",y="ID",hue='Sentiment')
plt.xticks(rotation=90)
plt.xlabel("Entity")
plt.ylabel("Number of tweets")
plt.grid()
plt.title("Distribution of tweets per Entity")
plt.show()
[14]: plt.figure(figsize=(10, 6))

count_table = pd.crosstab(index=df['Entity'], columns=df['Sentiment'])
sns.heatmap(count_table, cmap='YlOrRd', annot=True, fmt='d',linewidths=0.5,␣
↪linecolor='black')
plt.title('Sentiment Distribution by Entity')

plt.xlabel('Sentiment')
plt.ylabel('Entity')
plt.show()
6
[15]: # Convert entities to a single string
branches_text = ' '.join(count_table.index)
# Create word cloud

wordcloud = WordCloud(width=800, height=400, background_color='white').
↪generate(branches_text)
# Plot word cloud

plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Branches')
plt.axis('off')
plt.show()
7
[16]: # Concatenate all tweets into a single string
all_tweets_text = ' '.join(df['Tweet_content'])
# Create word cloud

wordcloud = WordCloud(width=800, height=400, background_color='white').
↪generate(all_tweets_text)
# Plot word cloud

plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Word Cloud of Tweets')
plt.axis('off')
plt.show()
8
6 Preprocess Function for Model
[17]: # load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")
[18]: # use this utility function to get the preprocessed text data
def preprocess(text):
# remove stop words and lemmatize the text
doc = nlp(text)
filtered_tokens = []
for token in doc:
if token.is_stop or token.is_punct:
continue
filtered_tokens.append(token.lemma_)
return " ".join(filtered_tokens)
Apply preprocess function on dataframe

[19]: df['Preprocessed Text'] = df['Tweet_content'].apply(preprocess)
[20]: df

9
… … … …
74677 9200 Nvidia Positive
Tweet_content \
… …
74677 Just realized that the Windows partition of my…
74678 Just realized that my Mac window partition is …
74679 Just realized the windows partition of my Mac …
74680 Just realized between the windows partition of…
74681 Just like the windows partition of my Mac is l…
Preprocessed Text
0 m get borderland murder
1 come border kill
2 m get borderland kill
3 m come borderland murder
4 m get borderland 2 murder
… …
74677 realize Windows partition Mac like 6 year Nvid…
74678 realize Mac window partition 6 year Nvidia dri…
74679 realize window partition Mac 6 year Nvidia dri…
74680 realize window partition Mac like 6 year Nvidi…
74681 like window partition Mac like 6 year driver i…
[71656 rows x 5 columns]
Encoding target column

[21]: le_model = LabelEncoder()
df['Sentiment'] = le_model.fit_transform(df['Sentiment'])
[22]: df.head(5)

0 2401 Borderlands 3
10
Tweet_content \
Preprocessed Text
0 m get borderland murder
1 come border kill
2 m get borderland kill
3 m come borderland murder
4 m get borderland 2 murder
Split data into train and test

[23]: X_train, X_test, y_train, y_test = train_test_split(df['Preprocessed Text'],␣
↪df['Sentiment'],
test_size=0.2,␣
↪random_state=42, stratify=df['Sentiment'])
[24]: print("Shape of X_train: ", X_train.shape)

print("Shape of X_test: ", X_test.shape)
Shape of X_train: (57324,)

Shape of X_test: (14332,)
7 Machine Learning Model

Two models are used: Multinomial Naive Bayes and Random Forest Classifier.
Metrics Selection * Accuracy * Classification Report * Confusion Matrix
Stop condition when training models * Multinomial Naive Bayes: Training stops once
the fit method completes, having processed all training samples. * Random Forest Classifier:
Training stops once the fit method completes, having trained the specified number of trees
(n_estimators) and adhering to any other stopping criteria (e.g., max_depth, min_samples_split,
min_samples_leaf).
11
7.0.1 Naive Bayes Model
[25]: # Create classifier

# A pipeline is created to streamline the preprocessing and model training␣
↪steps.
clf = Pipeline([
('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', (MultinomialNB()))
])
[26]: # Model training using the training data

clf.fit(X_train, y_train)
[26]: Pipeline(steps=[('vectorizer_tri_grams', TfidfVectorizer()),

('naive_bayes', MultinomialNB())])
[27]: # Get prediction

y_pred = clf.predict(X_test)
[28]: # Print score

print(accuracy_score(y_test, y_pred))
0.7229277142059727
[29]: # Print classification report

print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.94 0.44 0.60 2507

1 0.64 0.90 0.75 4340
2 0.84 0.64 0.73 3542
3 0.71 0.79 0.74 3943
accuracy 0.72 14332

macro avg 0.78 0.69 0.70 14332
weighted avg 0.76 0.72 0.72 14332
7.0.2 Random Forest

[30]: clf = Pipeline([
('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', (RandomForestClassifier()))
])
[31]: clf.fit(X_train, y_train)
12
[31]: Pipeline(steps=[('vectorizer_tri_grams', TfidfVectorizer()),
('naive_bayes', RandomForestClassifier())])
[32]: # Get the predictions for X_test and store it in y_pred

y_pred = clf.predict(X_test)
[33]: # Print Accuracy

print(accuracy_score(y_test, y_pred))
0.9108289143176109
[34]: # Print the classfication report

print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.97 0.84 0.90 2507

1 0.93 0.93 0.93 4340
2 0.94 0.90 0.92 3542
3 0.85 0.94 0.89 3943
accuracy 0.91 14332

macro avg 0.92 0.90 0.91 14332
weighted avg 0.91 0.91 0.91 14332
Based on the displayed results, we see that both models have very good indexes, in which Random
Forest Classifier has superior results with accuracy score, precision, recall, f1- score and support
are all very high.
8 Test Model
Get text
[35]: test_df = pd.read_csv('/kaggle/input/twitter-entity-sentiment-analysis/
↪twitter_validation.csv', names=columns)
test_df.head()

0 3364 Facebook Irrelevant
1 352 Amazon Neutral
2 8312 Microsoft Negative
3 4371 CS-GO Negative
4 4433 Google Neutral
Tweet_content
0 I mentioned on Facebook that I was struggling …
1 BBC News - Amazon boss Jeff Bezos rejects clai…
13
2 @Microsoft Why do I pay for WORD when it funct…
3 CSGO matchmaking is so full of closet hacking,…
4 Now the President is slapping Americans in the…
[36]: test_text = test_df['Tweet_content'][10]

print(f"{test_text} ===> {test_df['Sentiment'][10]}")
The professional dota 2 scene is fucking exploding and I completely welcome it.
Get the garbage out. ===> Positive

Apply preprocess
[37]: test_text_processed = [preprocess(test_text)]
test_text_processed
[37]: ['professional dota 2 scene fucking explode completely welcome \n\n garbage']
Get Prediction
[38]: test_text = clf.predict(test_text_processed)
Output
[39]: classes = ['Irrelevant', 'Natural', 'Negative', 'Positive']
print(f"True Sentiment: {test_df['Sentiment'][10]}")

print(f'Predict Sentiment: {classes[test_text[0]]}')
True Sentiment: Positive

Predict Sentiment: Positive
Irrelevant : 0 Natural : 1 Negative: 2 Positive: 3
9 Applicability of research results

The results of this sentiment analysis and other machine learning models can be applied in var-
ious future scenarios. * Personalized Marketing: Use sentiment analysis to tailor marketing
strategies based on customer sentiments. Personalized marketing campaigns can improve customer
engagement and conversion rates by targeting customers with content that resonates with their
sentiments and preferences. * Product Development and Improvement: Analyzing customer
feedback on new product features or updates helps product development teams understand what
features are well-received and which ones need improvement, leading to better product iterations
and increased user satisfaction. * Trend Analysis: Monitor sentiment trends over time to gauge
public opinion on various topics, products, or brands. This can provides insights into market trends
and consumer preferences, allowing businesses to stay ahead of the competition and make informed
strategic decisions.
14

Twitter Sentiment Analysis Dss

Uploaded by

Copyright:

Available Formats

Twitter Sentiment Analysis Dss

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Twitter Sentiment Analysis Dss

Uploaded by

Copyright:

Available Formats

twitter-sentiment-analysis-dss

June 27, 2024

2 Import Needed Modules

from sklearn.model_selection import train_test_split

3 Loading the data

# Print the shape of dataframe

# Print top 5 rows

[2]: ID Entity Sentiment \

[4]: # Check the distribution of Emotion

[5]: # Show sample

1: im getting on borderlands and i will murder you all , -> Positive

[6]: # check missing values

4.0.2 Dealing with duplicate value

[8]: # check duplicate values

[9]: # remove duplicate values

4.0.3 Drop nan values

# Calculate the total number of images in df

# Sort the dataframe by count

# Create the pie chart using matplotlib

plt.title('Proportions of target classes')

[12]: Entity Sentiment ID Tweet_content

[13]: #Figure of comparison per Entity

[14]: plt.figure(figsize=(10, 6))

plt.title('Sentiment Distribution by Entity')

# Create word cloud

# Plot word cloud

# Create word cloud

# Plot word cloud

return " ".join(filtered_tokens)

Apply preprocess function on dataframe

[20]: ID Entity Sentiment \

[71656 rows x 5 columns]

Encoding target column

[22]: ID Entity Sentiment \

Split data into train and test

[24]: print("Shape of X_train: ", X_train.shape)

Shape of X_train: (57324,)

7 Machine Learning Model

[25]: # Create classifier

[26]: # Model training using the training data

[26]: Pipeline(steps=[('vectorizer_tri_grams', TfidfVectorizer()),

[27]: # Get prediction

[28]: # Print score

[29]: # Print classification report

precision recall f1-score support

0 0.94 0.44 0.60 2507

accuracy 0.72 14332

7.0.2 Random Forest

[31]: clf.fit(X_train, y_train)

[32]: # Get the predictions for X_test and store it in y_pred

[33]: # Print Accuracy

[34]: # Print the classfication report

precision recall f1-score support

0 0.97 0.84 0.90 2507

accuracy 0.91 14332

[35]: ID Entity Sentiment \

[36]: test_text = test_df['Tweet_content'][10]

Get the garbage out. ===> Positive

print(f"True Sentiment: {test_df['Sentiment'][10]}")