Al_Phase3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

PHASE 3

SENTIMENT
ANALYSIS
FOR MARKETING

SOUNDHAR BALAJI.B

RADHIKA.M

ROHANSHAJ.K.R
VIGNESH.V

NELSON JOSEPH.M
Introduction:
 In the realm of natural language processing and sentiment analysis,
the journey to extract meaningful insights from text data commences
with the critical steps of loading and preprocessing the dataset.
These initial stages serve as the foundation upon which the entire
sentiment analysis solution is built.

 Loading the dataset is akin to unearthing a treasure trove of textual


information. It is the act of retrieving the raw data that will be the
lifeblood of your analysis. The source could be diverse - from social
media posts, customer reviews, or any corpus of text that holds the
sentiment of interest.

 However, raw text data is rarely ready for analysis in its pristine form.
Preprocessing is the transformative process that makes the data
amenable to machine learning and natural language processing
algorithms. It involves a series of steps like text cleaning, tokenization,
removing stop words, stemming, and lemmatization. This ensures
that the data is standardized, uniform, and free from noise, thus
enhancing the quality of insights derived.

 In this part of the project, we will delve into the crucial tasks of
loading the dataset, understanding its structure, and undertaking the
necessary preprocessing steps. This groundwork sets the stage for
subsequent phases, including feature engineering, model
development, and sentiment analysis. With a well-prepared dataset,
the journey towards understanding and harnessing sentiment within
the textual data can begin.

TASK:
Phase 3: Development Part 1

In this part you will begin building your project by loading and preprocessing the dataset.
Start building the sentiment analysis solution by
loading dataset and preprocessing the data.

DATASET :https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
NOTEBOOK LINK :https://drive.google.com/drive/folders/1G6Gqw6_E7Cs8dfOrto3jI4_eXOZGiDt3
In [20]: pip install nitk
Requirement already satisfied: nitk in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (3.8.1) Requirement
already satisfied: joblib in c:\users\sound\appdata\local\programs\python\python39\lib\site-
packages (from nitk) (1.3.2)
Requirement already satisfied: tqdm in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (from nitk) (4. 66.1)
Requirement already satisfied: regex>=2021.8.3 in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (fro m nitk)
(2023.10.3)
Requirement already satisfied: click in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (from nitk) (8.1.7)
Requirement already satisfied: colorama in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (from click ->nitk)
(0.4.6)

In [21]: import numpy as np import pandas as pd


'import re 'import emoji from nitk.stem import PorterStemmer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
-Frnm clelparn mnripl cplprtinn imnnrt train tact cnlit

In [22]: data pd.read_csv(r"C:\Users\sound\Downloads\Tweets.csv")

In [42]: data.head()

OUT 42:

In [23]: data
Preprocessing
confidence_threshold = 0.6

data = data.drop(data.query("airline_sentiment_confidence < @confidence_threshold").index,


axis=0).reset_index(drop=True)

tweets_df = pd.concat([data['text'], data['airline_sentiment']], axis=1)


tweets_df

tweets_df['airline_sentiment'].value_counts()

tweets_df.isna().sum().sum()
sentiment_ordering = ['negative', 'neutral', 'positive']

tweets_df['airline_sentiment'] = tweets_df['airline_sentiment'].apply(lambda x: sentiment_ordering.index(x))

tweets_df

emoji.demojize('@AmericanAir right on cue with the delays ')

@AmericanAir right on cue with the delays:OK_hand:'

ps = PorterStemmer()

def process_tweet(tweet):
new_tweet = tweet.lower()
new_tweet = re.sub(r'@\w+', '', new_tweet) # Remove @s
new_tweet = re.sub(r'#', '', new_tweet) # Remove hashtags
new_tweet = re.sub(r':', ' ', emoji.demojize(new_tweet)) # Turn emojis into words
new_tweet = re.sub(r'http\S+', '',new_tweet) # Remove URLs
new_tweet = re.sub(r'\$\S+', 'dollar', new_tweet) # Change dollar amounts to dollar
new_tweet = re.sub(r'[^a-z0-9\s]', '', new_tweet) # Remove punctuation
new_tweet = re.sub(r'[0-9]+', 'number', new_tweet) # Change number values to number
new_tweet = new_tweet.split(" ")
new_tweet = list(map(lambda x: ps.stem(x), new_tweet)) # Stemming the words
new_tweet = list(map(lambda x: x.strip(), new_tweet)) # Stripping whitespace from the words
if '' in new_tweet:
new_tweet.remove('')
return new_tweet

tweets = tweets_df['text'].apply(process_tweet)

labels = np.array(tweets_df['airline_sentiment'])
tweets

# Get size of vocabulary


vocabulary = set()

for tweet in tweets:


for word in tweet:
if word not in vocabulary:
vocabulary.add(word)

vocab_length = len(vocabulary)

# Get max length of a sequence


max_seq_length = 0

for tweet in tweets:


if len(tweet) > max_seq_length:
max_seq_length = len(tweet)

# Print results
print("Vocab length:", vocab_length)
print("Max sequence length:", max_seq_length)

Vocab length: 11250


Max sequence length: 90

tokenizer = Tokenizer(num_words=vocab_length)
tokenizer.fit_on_texts(tweets)

sequences = tokenizer.texts_to_sequences(tweets)

word_index = tokenizer.word_index

model_inputs = pad_sequences(sequences, maxlen=max_seq_length, padding='post')


model_inputs

model_inputs.shape

(14402, 90)

X_train, X_test, y_train, y_test = train_test_split(model_inputs, labels, train_size=0.7, random_state=22)

Training
embedding_dim = 32
inputs = tf.keras.Input(shape=(max_seq_length,))

embedding = tf.keras.layers.Embedding(
input_dim=vocab_length,
output_dim=embedding_dim,
input_length=max_seq_length
)(inputs)

# Model A (just a Flatten layer)


flatten = tf.keras.layers.Flatten()(embedding)

# Model B (GRU with a Flatten layer)


gru = tf.keras.layers.GRU(units=embedding_dim)(embedding)
gru_flatten = tf.keras.layers.Flatten()(gru)

# Both A and B are fed into the output


concat = tf.keras.layers.concatenate([flatten, gru_flatten])

outputs = tf.keras.layers.Dense(3, activation='softmax')(concat)


model = tf.keras.Model(inputs, outputs)

tf.keras.utils.plot_model(model)
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
batch_size = 32
epochs = 100
history = model.fit(
X_train,
y_train,
validation_split=0.2,
batch_size=batch_size,
epochs=epochs,
callbacks=[
tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True,
verbose=1
),
tf.keras.callbacks.ReduceLROnPlateau() ])

Epoch 1/100 252/252 [==============================] - 10s 30ms/step - loss: 0.7891 - accuracy:


0.6664 - val_loss: 0.6665 - val_accuracy: 0.7248 - lr: 0.0010 Epoch 2/100 252/252
[==============================] - 7s 27ms/step - loss: 0.5201 - accuracy: 0.7991 - val_loss:
0.5361 - val_accuracy: 0.7873 - lr: 0.0010 Epoch 3/100 252/252 [==============================]
- 7s 27ms/step - loss: 0.3668 - accuracy: 0.8710 - val_loss: 0.5028 - val_accuracy: 0.8002 -
lr: 0.0010 Epoch 4/100 252/252 [==============================] - 7s 28ms/step - loss: 0.2703
- accuracy: 0.9117 - val_loss: 0.5048 - val_accuracy: 0.8057 - lr: 0.0010 Epoch 5/100 252/252
[==============================] - 7s 28ms/step - loss: 0.2025 - accuracy: 0.9375 - val_loss:
0.5170 - val_accuracy: 0.8032 - lr: 0.0010 Epoch 6/100 252/252 [==============================]
- ETA: 0s - loss: 0.1512 - accuracy: 0.9604Restoring model weights from the end of the best
epoch: 3. 252/252 [==============================] - 7s 28ms/step - loss: 0.1512 - accuracy:
0.9604 - val_loss: 0.5317 - val_accuracy: 0.8012 - lr: 0.0010 Epoch 6: early stopping

Results
model.evaluate(X_test, y_test)

136/136 [==============================] - 1s 9ms/step - loss: 0.4885 - accuracy: 0.8051


[0.48851093649864197, 0.8051376938819885]
CONCLUSION
In this initial phase of our sentiment analysis project, we've made
significant progress by successfully loading and preprocessing the
dataset. This foundational step is crucial for the success of our entire
project, as the quality and structure of our data will directly impact the
accuracy and reliability of our sentiment analysis models. By loading the
dataset, we've bridged the gap between raw data and actionable insights,
making it accessible for further analysis. Our preprocessing efforts, which
included tasks such as text cleaning, tokenization, and handling missing
values, have improved the data's quality, making it ready for more
advanced natural language processing techniques.

Loading the dataset was more than just a technical task; it marked
the beginning of our journey towards understanding and predicting
sentiment in text. The dataset, comprised of text data from various
sources, holds the potential to reveal valuable insights about people's
opinions, emotions, and attitudes. By ensuring it is correctly structured
and prepared, we are one step closer to extracting meaningful
information. Our diligent preprocessing work ensures that the data is
consistent and free from common issues that could otherwise lead to
biased or inaccurate results in our sentiment analysis.

As we move forward in this sentiment analysis project, we can


build upon this solid foundation. The loaded and preprocessed dataset
serves as the cornerstone for our data-driven insights, allowing us to
explore different natural language processing techniques, sentiment
analysis algorithms, and model development. With this groundwork in
place, we are now better equipped to delve into the fascinating world of
sentiment analysis and ultimately provide valuable insights that can
inform decision-making, marketing strategies, and much more. Our
commitment to data quality and preprocessing sets the stage for the
success of our sentiment analysis solution.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy