Al_Phase3
Al_Phase3
Al_Phase3
SENTIMENT
ANALYSIS
FOR MARKETING
SOUNDHAR BALAJI.B
RADHIKA.M
ROHANSHAJ.K.R
VIGNESH.V
NELSON JOSEPH.M
Introduction:
In the realm of natural language processing and sentiment analysis,
the journey to extract meaningful insights from text data commences
with the critical steps of loading and preprocessing the dataset.
These initial stages serve as the foundation upon which the entire
sentiment analysis solution is built.
However, raw text data is rarely ready for analysis in its pristine form.
Preprocessing is the transformative process that makes the data
amenable to machine learning and natural language processing
algorithms. It involves a series of steps like text cleaning, tokenization,
removing stop words, stemming, and lemmatization. This ensures
that the data is standardized, uniform, and free from noise, thus
enhancing the quality of insights derived.
In this part of the project, we will delve into the crucial tasks of
loading the dataset, understanding its structure, and undertaking the
necessary preprocessing steps. This groundwork sets the stage for
subsequent phases, including feature engineering, model
development, and sentiment analysis. With a well-prepared dataset,
the journey towards understanding and harnessing sentiment within
the textual data can begin.
TASK:
Phase 3: Development Part 1
In this part you will begin building your project by loading and preprocessing the dataset.
Start building the sentiment analysis solution by
loading dataset and preprocessing the data.
DATASET :https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment
NOTEBOOK LINK :https://drive.google.com/drive/folders/1G6Gqw6_E7Cs8dfOrto3jI4_eXOZGiDt3
In [20]: pip install nitk
Requirement already satisfied: nitk in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (3.8.1) Requirement
already satisfied: joblib in c:\users\sound\appdata\local\programs\python\python39\lib\site-
packages (from nitk) (1.3.2)
Requirement already satisfied: tqdm in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (from nitk) (4. 66.1)
Requirement already satisfied: regex>=2021.8.3 in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (fro m nitk)
(2023.10.3)
Requirement already satisfied: click in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (from nitk) (8.1.7)
Requirement already satisfied: colorama in
c:\users\sound\appdata\local\programs\python\python39\lib\site-packages (from click ->nitk)
(0.4.6)
In [42]: data.head()
OUT 42:
In [23]: data
Preprocessing
confidence_threshold = 0.6
tweets_df['airline_sentiment'].value_counts()
tweets_df.isna().sum().sum()
sentiment_ordering = ['negative', 'neutral', 'positive']
tweets_df
ps = PorterStemmer()
def process_tweet(tweet):
new_tweet = tweet.lower()
new_tweet = re.sub(r'@\w+', '', new_tweet) # Remove @s
new_tweet = re.sub(r'#', '', new_tweet) # Remove hashtags
new_tweet = re.sub(r':', ' ', emoji.demojize(new_tweet)) # Turn emojis into words
new_tweet = re.sub(r'http\S+', '',new_tweet) # Remove URLs
new_tweet = re.sub(r'\$\S+', 'dollar', new_tweet) # Change dollar amounts to dollar
new_tweet = re.sub(r'[^a-z0-9\s]', '', new_tweet) # Remove punctuation
new_tweet = re.sub(r'[0-9]+', 'number', new_tweet) # Change number values to number
new_tweet = new_tweet.split(" ")
new_tweet = list(map(lambda x: ps.stem(x), new_tweet)) # Stemming the words
new_tweet = list(map(lambda x: x.strip(), new_tweet)) # Stripping whitespace from the words
if '' in new_tweet:
new_tweet.remove('')
return new_tweet
tweets = tweets_df['text'].apply(process_tweet)
labels = np.array(tweets_df['airline_sentiment'])
tweets
vocab_length = len(vocabulary)
# Print results
print("Vocab length:", vocab_length)
print("Max sequence length:", max_seq_length)
tokenizer = Tokenizer(num_words=vocab_length)
tokenizer.fit_on_texts(tweets)
sequences = tokenizer.texts_to_sequences(tweets)
word_index = tokenizer.word_index
model_inputs.shape
(14402, 90)
Training
embedding_dim = 32
inputs = tf.keras.Input(shape=(max_seq_length,))
embedding = tf.keras.layers.Embedding(
input_dim=vocab_length,
output_dim=embedding_dim,
input_length=max_seq_length
)(inputs)
tf.keras.utils.plot_model(model)
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
batch_size = 32
epochs = 100
history = model.fit(
X_train,
y_train,
validation_split=0.2,
batch_size=batch_size,
epochs=epochs,
callbacks=[
tf.keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=3,
restore_best_weights=True,
verbose=1
),
tf.keras.callbacks.ReduceLROnPlateau() ])
Results
model.evaluate(X_test, y_test)
Loading the dataset was more than just a technical task; it marked
the beginning of our journey towards understanding and predicting
sentiment in text. The dataset, comprised of text data from various
sources, holds the potential to reveal valuable insights about people's
opinions, emotions, and attitudes. By ensuring it is correctly structured
and prepared, we are one step closer to extracting meaningful
information. Our diligent preprocessing work ensures that the data is
consistent and free from common issues that could otherwise lead to
biased or inaccurate results in our sentiment analysis.