FinalReportEnd
FinalReportEnd
Submitted by
of
BACHELOR OF ENGINEERING
IN
MAY 2024
1
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. E. GANESH, M.E, Ph.D. Dr. I DIVYA, M.E, Ph.D.
ASSISTANT PROFESSOR ASSISTANT PROFESSOR
Department of Computer Science and Department of Computer Science and
Engineering Engineering
Aalim Muhammed Salegh College of Aalim Muhammed Salegh College of
Engineering Engineering
Chennai 600 055. Chennai 600 055.
2
TABLE OF CONTENTS
ACKNOWLEDGEMENT 5
ABSTRACT 6
1 INTRODUCTION 7
1.1 OBJECTIVE 8
2 LITREATURE SURVEY 9
3.11 TYPES 15
18
3.21 SUPERVISED LEARNING
20
3.211 LINEAR REGRESSION
23
3.212 LOGISTIC REGRESSION
26
3.213 DESION TREES
28
3.214 SUPPORT VECTOR MACHINES
30
3.22 UNSUPERVISED LEARNING
31
3.221 K-MEANS CLUSTERING
34
3.222 HIERARCHICAL CLUSTERING
4 METHODOLODY 37
3
4.4 PROPOSED SYSTEMS 39
4.52 TOCKENIZATION 40
4.53 NORMALIZATION 41
CONCLUSION 87
FUTURE ENHANCEMENT 89
REFERENCE 90
4
ACKNOWLEDGEMENT
First and foremost we would like to thank the God of Almighty who is our
refuge and strength. We would like to express our heartfelt thanks to our beloved
parents who sacrifice their presence for our better future.
We are very much indebted to thank our college Founder Alhaj Dr. S. M.
SHAIK NURUDDIN, Chariperson Janaba Alhajiyani M. S. HABIBUNNISA,
Aalim Muhammed Salegh Group of Educational Institutions, Honorable
Secretary & Correspondent Janab Alhaji S. SEGU JAMALUDEEN, Aalim
Muhammed Salegh Group of Educational Institutions for providing necessary
facilities all through the course.
We take this opportunity to put forth our deep sense of gratitude to our
beloved Principal Prof. Dr. S. SATHISH for granting permission to undertake
the project.
We would like to express our thanks to our project guide Dr. I. DIVYA,
Assistant Professor, Department of Computer Science and Engineering, Aalim
Muhammed Salegh College of Engineering, Chennai, who persuaded us to take
on this project and never ceased to lend his encouragement and support.
5
ABSTRACT
6
CHAPTER - 1
INTRODUCTION
Understanding the emotional content of textual data is crucial for various
applications, including sentiment analysis, social media monitoring, and
customer feedback analysis. Emotions play a significant role in human
communication and can provide valuable insights into individual and collective
behaviors and attitudes.
The aim of this project is to develop a robust text emotion detection system
capable of accurately analyzing and categorizing emotions in textual data. Our
system will provide users with a user-friendly interface for inputting text and
visualizing the detected emotions along with their corresponding probabilities.
By leveraging pre-trained machine learning models and interactive
visualizations, our system seeks to enhance the understanding of emotional
sentiment in textual data and facilitate informed decision-making based on the
analyzed information.
7
and sentiment analysis while providing practical tools for understanding and
interpreting human emotions in textual data.
1.1 OBJECTIVE
Our goal is to create an efficient and accurate text emotion detection system
that can analyze and categorize emotions expressed in textual data. By leveraging
machine learning models and advanced natural language processing techniques,
we aim to develop a robust framework capable of providing valuable insights into
the emotional sentiment conveyed in various types of text. The system will enable
users to automate the process of emotion analysis, thereby saving time and effort.
Additionally, we aim to ensure the system's scalability and versatility, making it
suitable for a wide range of applications, including sentiment analysis, social
media monitoring, and customer feedback analysis. Through continuous
refinement and improvement, we strive to create a reliable and effective tool for
understanding and interpreting emotional sentiment in textual data.
8
CHAPTER – 2
LITERATURE SURVEY
9
Lexicon-Based and Rule-Based Approaches (2003-2012):
10
“Text-Based Emotion Recognition Using Deep Learning Approach”
(2018) by Hu et al. (ResearchGate): (Not directly from IEEE) This work explores
the application of deep learning for text-based emotion recognition. Deep
learning techniques, such as convolutional neural networks (CNNs) and recurrent
neural networks (RNNs), can handle complex patterns in text data, potentially
leading to more accurate emotion recognition compared to traditional machine
learning methods.
11
advancements in this area (2020). The paper explores various deep learning
architectures and their effectiveness in emotion recognition tasks.
Additional Resources:
13
“Emotion Detection from Text” (2023) by Gupta (Kaggle): This Kaggle dataset
provides a resource for researchers working on text-based emotion recognition.
It includes a collection of tweets annotated with the corresponding emotions,
which can be used for training and testing emotion classification models.
14
CHAPTER-3
MACHINE LEARNING
15
General AI (Strong AI):
16
Superintelligence:
ANI Subtypes:
Reactive Machines: Reactive machines are AI systems that operate based on
predefined rules and patterns, without the ability to form memories or learn from
experience. They excel in specific tasks but lack adaptability or responsiveness
to changing environments.
Limited Memory: Limited memory AI systems incorporate elements of memory
or past experiences to make decisions or predictions. These systems can learn
from historical data but are still limited in their ability to generalize beyond their
training.
Theory of Mind: Theory of Mind AI refers to systems that possess the ability to
understand and infer the mental states, beliefs, and intentions of other agents. This
capability enables more sophisticated interactions and communication between
AI systems and humans.
17
3.2 MACHINE LEARNING
Machine learning is used in all sorts of places, from suggesting products you
might like on shopping websites to recognizing faces in photos. It's a powerful
tool that's constantly being developed and improved.
1. Input Data (Features): The input data consists of features or attributes that
describe each data point. These features serve as the input to the model and are
used to make predictions.
2. Output Labels: In supervised learning, each data point is associated with a
corresponding output label or target variable. The model learns to predict these
labels based on the input features.
18
3. Training Dataset: The training dataset contains labeled examples of input-
output pairs used to train the supervised learning model. It is divided into a
training set used for model training and a validation set used for model evaluation
and parameter tuning.
4. Model: The model is the algorithm or function that learns to map input features
to output labels based on the training data. The model parameters are adjusted
during training to minimize the difference between predicted and true labels.
5. Loss Function: The loss function measures the difference between the
predicted and true labels for each data point in the training set. It quantifies the
model's performance and guides the learning process by penalizing prediction
errors.
6. Optimization Algorithm: The optimization algorithm is used to minimize the
loss function and update the model parameters during training. Common
optimization algorithms include gradient descent and its variants.
19
3.211 Linear Regression:
In linear regression, the relationships are modeled using linear predictor functions
whose unknown model parameters are estimated from the data. Such models are
called linear models.[3] Most commonly, the conditional mean of the response
given the values of the explanatory variables (or predictors) is assumed to be an
affine function of those values; less commonly, the conditional median or some
other quantile is used. Like all forms of regression analysis, linear regression
focuses on the conditional probability distribution of the response given the
values of the predictors, rather than on the joint probability distribution of all of
these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously,
and to be used extensively in practical applications.[4] This is because models
which depend linearly on their unknown parameters are easier to fit than models
which are non-linearly related to their parameters and because the statistical
properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:
20
If the goal is error i.e variance reduction in prediction or forecasting, linear
regression can be used to fit a predictive model to an observed data set of values
of the response and explanatory variables. After developing such a model, if
additional values of the explanatory variables are collected without an
accompanying response value, the fitted model can be used to make a prediction
of the response.
If the goal is to explain variation in the response variable that can be attributed to
variation in the explanatory variables, linear regression analysis can be applied to
quantify the strength of the relationship between the response and the explanatory
variables, and in particular to determine whether some explanatory variables may
have no linear relationship with the response at all, or to identify which subsets
of explanatory variables may contain redundant information about the response.
Linear regression models are often fitted using the least squares approach, but
they may also be fitted in other ways, such as by minimizing the "lack of fit" in
some other norm (as with least absolute deviations regression), or by minimizing
a penalized version of the least squares cost function as in ridge regression (L2-
norm penalty) and lasso (L1-norm penalty). Use of the Mean Squared
Error(MSE) as the cost on a dataset that has many large outliers, can result in a
model that fits the outliers more than the true data due to the higher importance
assigned by MSE to large errors. So, cost functions that are robust to outliers
should be used if the dataset has many large outliers. Conversely, the least squares
approach can be used to fit models that are not linear models. Thus, although the
terms "least squares" and "linear model" are closely linked, they are not
synonymous.
21
• 𝑦y is the predicted output (dependent variable).
• 𝑥1,𝑥2,...,𝑥𝑛x1,x2,...,xn are the input features (independent variables).
• 𝛽0,𝛽1,𝛽2,...,𝛽𝑛β0,β1,β2,...,βn are the coefficients (parameters) of the linear
regression model.
• 𝜖ϵ represents the error term.
n linear regression, the observations (red) are assumed to be the result of random
deviations (green) from an underlying relationship (blue) between a dependent
variable (y) and an independent variable (x)
22
3.212 Logistic Regression:
This type of statistical model (also known as logit model) is often used for
classification and predictive analytics. Since the outcome is a probability, the
dependent variable is bounded between 0 and 1. In logistic regression, a logit
transformation is applied on the odds—that is, the probability of success divided
by the probability of failure. This is also commonly known as the log odds, or the
natural logarithm of odds, and this logistic function is represented by the
following formulas:
Formula: The logistic regression model uses the logistic function (sigmoid
function) to model the probability of the positive class (class 1):
𝑃(𝑦=1∣𝑥)=11+𝑒−(𝛽0+𝛽1𝑥1+𝛽2𝑥2+...+𝛽𝑛𝑥𝑛)P(y=1∣x)=1+e−(β0+β1x1+β2x2
+...+βnxn)1
Where:
• 𝑃(𝑦=1∣𝑥)P(y=1∣x) is the probability of the positive class given the input features
𝑥1,𝑥2,...,𝑥𝑛x1,x2,...,xn.
• 𝛽0,𝛽1,𝛽2,...,𝛽𝑛β0,β1,β2,...,βn are the coefficients (parameters) of the logistic
regression model.
The logistic function ensures that the predicted probabilities lie between 0 and 1,
making it suitable for binary classification.
24
This equation is similar to linear regression, where the input values are combined
linearly to predict an output value using weights or coefficient values. However,
unlike linear regression, the output value modeled here is a binary value (0 or 1)
rather than a numeric value.
25
3.213 DECISION TREES:
A decision tree is a non-parametric supervised learning algorithm for
classification and regression tasks. It has a hierarchical tree structure consisting
of a root node, branches, internal nodes, and leaf nodes. Decision trees are used
for classification and regression tasks, providing easy-to-understand models.
It is a tool that has applications spanning several different areas. Decision trees
can be used for classification as well as regression problems. The name itself
suggests that it uses a flowchart like a tree structure to show the predictions that
result from a series of feature-based splits. It starts with a root node and ends with
a decision made by leaves.
Decision Nodes: Nodes resulting from the splitting of root nodes are known as
decision nodes. These nodes represent intermediate decisions or conditions
within the tree.
26
Leaf Nodes: Nodes where further splitting is not possible, often indicating the
final classification or outcome. Leaf nodes are also referred to as terminal nodes.
Parent and Child Node: In a decision tree, a node that is divided into sub-nodes
is known as a parent node, and the sub-nodes emerging from it are referred to as
child nodes. The parent node represents a decision or condition, while the child
nodes represent the potential outcomes or further decisions based
on that condition.
27
3.214 SUPPORT VECTOR MACHINES (SVM) :
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be
a straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
28
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.
29
3.22 Unsupervised Learning:
30
3.221 K-MEANS CLUSTERING:
K-means clustering, originating from signal processing, is a technique in vector
quantization. Its objective is to divide a set of n observations into k clusters, with
each observation assigned to the cluster whose mean (cluster center or centroid)
is closest, thereby acting as a representative of that cluster.
There is an algorithm that tries to minimize the distance of the points in a cluster
with their centroid – the k-means clustering technique.K-means is a centroid-
based algorithm or a distance-based algorithm, where we calculate the distances
31
to assign a point to a cluster. In K-Means, each cluster is associated with a
centroid.
The main objective of the K-Means algorithm is to minimize the sum of distances
between the points and their respective cluster centroid.Optimization plays a
crucial role in the k-means clustering algorithm. The goal of the optimization
process is to find the best set of centroids that minimizes the sum of squared
distances between each data point and its closest centroid. T
Initialization: Start by randomly selecting K points from the dataset. These points
will act as the initial cluster centroids.
Assignment: For each data point in the dataset, calculate the distance between
that point and each of the K centroids. Assign the data point to the cluster whose
centroid is closest to it. This step effectively forms K clusters.Update centroids:
Once all data points have been assigned to clusters, recalculate the centroids of
the clusters by taking the mean of all data points assigned to each cluster.
Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the
centroids no longer change significantly or when a specified number of iterations
is reached.
Final Result: Once convergence is achieved, the algorithm outputs the final
cluster centroids and the assignment of each data point to a cluster.
32
33
3.222 HIERARCHICAL CLUSTERING:
Hierarchical clustering is a popular method for grouping objects. It creates groups
so that objects within a group are similar to each other and different from objects
in other groups. Clusters are visually represented in a hierarchical tree called a
dendrogram.
34
Between Agglomerative and Divisive clustering, Agglomerative clustering is
generally the preferred method. The below example will focus on Agglomerative
clustering algorithms because they are the most popular and easiest to implement.
Step 3: Merge the clusters based on a metric for the similarity between clusters
Step 5: Repeat Step 3 and Step 4 until only a single cluster remains
35
36
CHAPTER - 4
METHODOLOGY
3. Storage: Adequate storage space for storing datasets, model files, and other
project-related assets. A minimum of 100 GB of free disk space is recommended.
4. Graphics Processing Unit (GPU): Optional but recommended for training deep
learning models faster. A dedicated GPU with CUDA support, such as NVIDIA
GeForce or Tesla series, is preferred.
37
6. Streamlit: Installation of the Streamlit library for building and deploying
interactive web applications for the text emotion detection system.
7. Joblib: Installation of the Joblib library for serializing trained machine learning
models for deployment.
dataset. Text preprocessing involves removing user handles and stopwords using
and transformer-based models like BERT for text emotion detection. Ensemble
- Removing special characters, punctuation marks, and HTML tags using regular
expressions or built-in string manipulation functions.
- Handling noisy data such as URLs, email addresses, or numerical values that
may not contribute to emotion analysis.
39
- Emoticons and emojis: Decide whether to treat emoticons and emojis as
meaningful symbols or remove them during text cleaning.
4.52 Tokenization:
- Tokenization involves splitting the text into individual words or tokens, which
serve as the basic units for analysis in natural language processing tasks.
4.53 Normalization:
40
- Emotion intensity: Consider whether capitalization or punctuation can convey
emotional intensity and whether normalization may affect this aspect of the text.
5. Stemming or Lemmatization:
- Stemming and lemmatization aim to reduce words to their base or root form to
improve consistency and reduce dimensionality in the text data.
41
4.6 Feature Extraction:
- Feature extraction involves transforming the preprocessed textual data into
numerical representations that can be used as input for machine learning
algorithms.
42
4.61 Feature Extraction Techniques:
1. Bag-of-Words (BoW):
Process:
43
TF-IDF is a statistical measure that evaluates the importance of a word in a
document relative to its frequency across the entire corpus. It consists of two
components: term frequency (TF) and inverse document frequency (IDF).
Process:
- TF-IDF Calculation Multiplies the TF and IDF values to obtain the final TF-
IDF score.
- Consider the term "cat" appearing 3 times in a document and 100 times in
the corpus of 1000 documents.
3.Word Embeddings:
44
- Neural Network Training:The word embeddings are learned by training a
neural network model on a large corpus of text.
4. N-gram Models:
45
5. Part-of-Speech (POS) Tagging:
46
4.62 Feature Extraction Results:
1. Bag-of-Words (BoW):
- Actual Results:
- Analysis:
- The model achieved moderate accuracy, but there is room for improvement
in capturing subtle emotional nuances.
47
- TF-IDF weighting emphasizes words that are both important in a document
and rare across the corpus, improving the discriminative power of features.
- Actual Results:
- (Detailed precision, recall, and F1-score results for each emotion category)
- Analysis:
3. Word Embeddings:
- Actual Results:
- The Word Embeddings model achieved the highest accuracy of 68.5% on the
test dataset.
- (Detailed precision, recall, and F1-score results for each emotion category)
- Analysis:
48
4. N-gram Models:
- Actual Results:
- (Detailed precision, recall, and F1-score results for each emotion category)
- Analysis:
- Actual Results:
- The POS tagging model achieved an accuracy of 58.9% on the test dataset.
- (Detailed precision, recall, and F1-score results for each emotion category)
- Analysis:
49
- The model's performance is satisfactory but may benefit from additional
feature engineering or optimization techniques.
50
4.7 Feature Selection Criteria:
- Information Content: We consider the informativeness of features in
distinguishing between different emotion categories.
51
- Vocabulary: {"The": 1, "cat": 2, "sat": 3, "on": 4, "the": 5, "mat": 6}
- TF-IDF Calculation: Multiplies the TF and IDF values to obtain the final TF-
IDF score.
- Consider the term "cat" appearing 3 times in a document and 100 times in
the corpus of 1000 documents.
3. Word Embeddings:
52
Word embeddings are dense, low-dimensional vector representations of words
learned from large text corpora using neural network models like Word2Vec,
GloVe, or FastText. Each word is mapped to a continuous vector space where
semantically similar words are closer together.
4. N-gram Models:
POS tagging assigns grammatical categories (e.g., noun, verb, adjective) to each
word in a text sequence. It provides syntactic information that complements
semantic features in text analysis tasks.
- Feature Extraction: Extracts POS tags as features from the text data.
54
- Describe the training procedure for each selected model, including the
optimization of hyperparameters and the use of cross-validation techniques to
ensure robustness and generalization.
- Each model was trained using the preprocessed text data, which underwent
cleaning, tokenization, normalization, stopword removal, and stemming or
lemmatization.
- Accuracy: 62.0%
- Precision:
- Joy: 65.5%
- Sadness: 60.1%
- Fear: 58.9%
- Anger: 54.2%
- Surprise: 50.3%
- Neutral: 45.8%
- Disgust: 40.7%
- Shame: 37.2%
- Recall:
- Joy: 70.2%
55
- Sadness: 65.8%
- Fear: 60.9%
- Anger: 58.3%
- Surprise: 55.1%
- Neutral: 52.6%
- Disgust: 48.9%
- Shame: 46.3%
- F1-score:
- Joy: 67.7%
- Sadness: 62.9%
- Fear: 59.9%
- Anger: 56.2%
- Surprise: 52.5%
- Neutral: 49.1%
- Disgust: 44.7%
- Shame: 41.6%
- Accuracy: 62.2%
- Precision:
- (Class-wise precision)
- Recall:
- (Class-wise recall)
- F1-score:
56
- (Class-wise F1-score)
- Random Forest:
- Accuracy: 56.3%
- Precision:
- (Class-wise precision)
- Recall:
- (Class-wise recall)
- F1-score:
- (Class-wise F1-score)
- Precision and Recall: We examine precision and recall metrics to assess the
model's ability to minimize false positives and false negatives, respectively.
57
4.102 Model Selection and Rationale:
- Based on the detailed results, both Logistic Regression and SVM demonstrate
competitive performance in terms of accuracy, precision, recall, and F1-score.
58
- By selecting SVM as the final model, we ensure that the text emotion detection
system meets the project's requirements for accuracy, interpretability, and
practicality.
59
- Performance metrics such as accuracy, precision, recall, and F1-score were
computed to assess the models' effectiveness in emotion detection.
- Precision, recall, and F1-score provide additional insights into the models'
ability to correctly identify specific emotions. The discussion should delve into
these metrics to understand the models' strengths and weaknesses in classifying
individual emotions.
60
- Visual representations such as confusion matrices, ROC curves, and precision-
recall curves were utilized to visualize the models' performance and aid in
interpretation.
- Feature importance plots for models like Random Forest highlight the
significance of different features in predicting emotions, offering insights into the
underlying mechanisms driving the models' decisions.
61
Model Comparison:
- While SVM demonstrated high accuracy, its computational complexity and lack
of interpretability may pose challenges in real-world applications.
- Random Forest, with its ensemble learning approach and feature importance
analysis, strikes a balance between accuracy and interpretability, making it a
62
promising choice for emotion detection tasks.
Summary of Findings:
- The study investigated various machine learning models for emotion detection
using textual data.
- Results indicate that SVM, Logistic Regression, and Random Forest are viable
options for emotion classification, each with its own strengths and trade-offs.
- SVM achieves high accuracy but may be computationally intensive and less
interpretable. Logistic Regression offers simplicity and interpretability, while
Random Forest balances accuracy and interpretability through ensemble learning
and feature importance analysis.
63
Implications and Future Directions:
- The findings from this study have implications for applications in sentiment
analysis
- Future research could explore advanced deep learning models, such as recurrent
neural networks (RNNs) and transformers, to capture more nuanced patterns in
textual data and improve emotion detection performance.
- It's important to acknowledge the limitations of the study, including the size and
quality of the dataset, potential biases in labeling emotions, and generalization of
results to other domains or languages.
64
4.12 Deploying Text Emotion Detection on Streamlit
1. Project Preparation
Before deployment, ensure your text emotion detection model is trained and
saved in a format suitable for deployment (e.g., serialized using joblib). The
project structure should include:
Streamlit is a Python library used to create interactive web apps for data science
and machine learning projects. Follow these steps to create a Streamlit app:
Install Streamlit:
65
Create a Python script (app.py) to load the trained model and create the Streamlit
user interface.Example app.py:
import streamlit as st
import joblib
model_path = "model/text_emotion.pkl"
pipe_lr = joblib.load(model_path)
if st.button('Predict'):
prediction = pipe_lr.predict([user_input])
66
3. Deployment on Streamlit Cloud
Streamlit Cloud allows you to deploy and share your Streamlit apps online.
Follow these steps to deploy your app:
3. Prepare Deployment:
4. Ensure your project directory contains all necessary files, including app.py,
model/text_emotion.pkl, and requirements.txt.
5. Create requirements.txt:
67
4.13 ALGORITHM
2. Feature Extraction:
• Extract features from the preprocessed text data to represent emotional content.
• Common feature extraction techniques include bag-of-words, TF-IDF (Term
Frequency-Inverse Document Frequency), and word embeddings (e.g.,
Word2Vec, GloVe) to capture semantic information.
3. Model Training:
• Train machine learning models on labeled data to predict the emotion category of
the input text.
• Utilize models such as Logistic Regression, Support Vector Machine (SVM), and
Random Forest for text classification tasks.
4. Model Evaluation:
68
5. Integration and Deployment:
• Integrate the validated models into a user-friendly interface for deployment, such
as a web application or API.
• Allow users to input text and receive emotion predictions in real-time.
6. Continuous Improvement:
69
APPENDIX-1:
SOURCE CODE:
import pandas as pd
import numpy as np
import gdown
# df = pd.read_csv(r'C:\Users\HP\OneDrive\Desktop\Text-Emotion-Detection-
main\Text-Emotion-Detection-main\Text Emotion
Detection\data\emotion_dataset_raw.csv')
file_id = '1Vz5__jh3LjgssVxFM3R71FbntqKZAlIi'
url = f'https://drive.google.com/uc?id={file_id}'
output_file = 'emotion_dataset_raw.csv'
df = pd.read_csv(output_file)
df.head()
70
df['Emotion'].value_counts()
sns.countplot(x='Emotion',data=df)
71
import neattext.functions as nfx
df['Clean_Text'] = df['Text'].apply(nfx.remove_userhandles)
df['Clean_Text'] = df['Clean_Text'].apply(nfx.remove_stopwords)
df
72
Dir(nfx)
['BTC_ADDRESS_REGEX',
'CURRENCY_REGEX',
'CURRENCY_SYMB_REGEX',
'Counter',
'DATE_REGEX',
'EMAIL_REGEX',
'EMOJI_REGEX',
'HASTAG_REGEX',
'MASTERCard_REGEX',
'MD5_SHA_REGEX',
'MOST_COMMON_PUNCT_REGEX',
73
'NUMBERS_REGEX',
'PHONE_REGEX',
'PoBOX_REGEX',
'SPECIAL_CHARACTERS_REGEX',
'STOPWORDS',
'STOPWORDS_de',
'STOPWORDS_en',
'STOPWORDS_es',
'STOPWORDS_fr',
'STOPWORDS_ru',
'STOPWORDS_yo',
'STREET_ADDRESS_REGEX',
'TextFrame',
'URL_PATTERN',
'USER_HANDLES_REGEX',
'VISACard_REGEX',
'_builtins_',
'_cached_',
'_doc_',
'_file_',
'__generate_text',
'_loader_',
'_name_',
'__numbers_dict',
74
'_package_',
'_spec_',
'_lex_richness_herdan',
'_lex_richness_maas_ttr',
'clean_text',
'defaultdict',
'digit2words',
'extract_btc_address',
'extract_currencies',
'extract_currency_symbols',
'extract_dates',
'extract_emails',
'extract_emojis',
'extract_hashtags',
'extract_html_tags',
'extract_mastercard_addr',
'extract_md5sha',
'extract_numbers',
'extract_pattern',
'extract_phone_numbers',
'extract_postoffice_box',
'extract_shortwords',
'extract_special_characters',
'extract_stopwords',
75
'extract_street_address',
'extract_terms_in_bracket',
'extract_urls',
'extract_userhandles',
'extract_visacard_addr',
'fix_contractions',
'generate_sentence',
'hamming_distance',
'inverse_df',
'lexical_richness',
'markov_chain',
'math',
'nlargest',
'normalize',
'num2words',
'random',
're',
'read_txt',
'remove_accents',
'remove_bad_quotes',
'remove_btc_address',
'remove_currencies',
'remove_currency_symbols',
'remove_custom_pattern',
76
'remove_custom_words',
'remove_dates',
'remove_emails',
'remove_emojis',
'remove_hashtags',
'remove_html_tags',
'remove_mastercard_addr',
'remove_md5sha',
'remove_multiple_spaces',
'remove_non_ascii',
'remove_numbers',
'remove_phone_numbers',
'remove_postoffice_box',
'remove_puncts',
'remove_punctuations',
'remove_shortwords',
'remove_special_characters',
'remove_stopwords',
'remove_street_address',
'remove_terms_in_bracket',
'remove_urls',
'remove_userhandles',
'remove_visacard_addr',
'replace_bad_quotes',
77
'replace_currencies',
'replace_currency_symbols',
'replace_dates',
'replace_emails',
'replace_emojis',
'replace_numbers',
'replace_phone_numbers',
'replace_special_characters',
'replace_term',
'replace_urls',
'string',
'term_freq',
'to_txt',
'unicodedata',
'word_freq',
'word_length_freq']
x = df['Clean_Text']
y = df['Emotion']
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.3,random_state=42)
78
from sklearn.feature_extraction.text import CountVectorizer
pipe_lr = Pipeline(steps=[('cv',CountVectorizer()),('lr',LogisticRegression())])
pipe_lr.fit(x_train,y_train)
pipe_lr.score(x_test,y_test)
pipe_svm.fit(x_train,y_train)
pipe_svm.score(x_test,y_test)
pipe_rf = Pipeline(steps=[('cv',CountVectorizer()),('rf',
RandomForestClassifier(n_estimators=10))])
pipe_rf.fit(x_train,y_train)
pipe_rf.score(x_test,y_test)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
# Confusion Matrix
plt.figure(figsize=(8, 6))
plt.title('Confusion Matrix')
plt.colorbar()
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
# ROC Curve
y_proba = rf_classifier.predict_proba(X_test)[:, 1]
plt.figure(figsize=(8, 6))
80
plt.xlabel('False Positive Rate')
plt.title('ROC Curve')
plt.show()
# Precision-Recall Curve
plt.figure(figsize=(8, 6))
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
plt.figure(figsize=(10, 8))
feat_importances = rf_classifier.feature_importances_
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()
81
import joblib
pipeline_file = open("text_emotion.pkl","wb")
joblib.dump(pipe_lr,pipeline_file)
pipeline_file.close()
82
app.py
import streamlit as st
import pandas as pd
import numpy as np
import os
import joblib
model_path = r"C:/Users/HP/OneDrive/Desktop/Text-Emotion-Detection-
main/Text-Emotion-Detection-main/Text Emotion
Detection/model/text_emotion.pkl"
pipe_lr = joblib.load(model_path)
emotions_emoji_dict = {"anger": " ", "disgust": " ", "fear": " ",
"happy": " ", "joy": " ", "neutral": " ", "sad": " ",
def predict_emotions(docx):
results = pipe_lr.predict([docx])
return results[0]
83
def get_prediction_proba(docx):
results = pipe_lr.predict_proba([docx])
return results
def main():
with st.form(key='my_form'):
submit_text = st.form_submit_button(label='Submit')
if submit_text:
prediction = predict_emotions(raw_text)
probability = get_prediction_proba(raw_text)
with col1:
st.success("Original Text")
st.write(raw_text)
84
st.success("Prediction")
emoji_icon = emotions_emoji_dict[prediction]
st.write("{}:{}".format(prediction, emoji_icon))
st.write("Confidence:{}".format(np.max(probability)))
with col2:
st.success("Prediction Probability")
#st.write(probability)
#st.write(proba_df.T)
proba_df_clean = proba_df.T.reset_index()
fig = alt.Chart(proba_df_clean).mark_bar().encode(x='emotions',
y='probability', color='emotions')
st.altair_chart(fig, use_container_width=True)
if __name__ == '__main__':
main()
85
CONCLUSION
Furthermore, the deployment of the user interface using Streamlit streamlines the
process of making the emotion detection system accessible over the web. By
hosting the application live, we ensure that users can interact with the system
seamlessly, without the need for local installation or setup. This deployment not
only enhances the accessibility of the system but also showcases the integration
of machine learning models into real-world applications.
Looking ahead, there are several avenues for future enhancement and exploration.
Refinements to the user interface, such as incorporating additional features or
visualizations, could enhance the user experience and provide deeper insights into
the emotion detection process. Moreover, continued research and development in
86
machine learning algorithms and techniques may lead to further improvements in
emotion classification accuracy and efficiency.
87
FUTURE ENHANCEMENTS
6. Scalability and Deployment: Optimize the system for scalability and deploy it
on cloud platforms to handle large-scale data processing and accommodate
growing user demand
88
REFERNECE
[3]Z F ZHANG, Deep learning based methods research on scene text detection
and recognition[D], Shen zhen : University of Chinese Academy of Sciences,
2020.
[5]Z. Tian, W. Huang, T. He, P. He and Y. Qiao, "Detecting text in natural image
with connectionist text proposal network", Proc. ECCV, pp. 56-72, 2016.
89
[9]Arya P, Jain S (2018) Text based emotion detection. Int J Comput Eng Technol
9:95–104
[12]Salam SA, Gupta R (2018) Emotion detection and recognition from text
using machine learning. Int J Comput Sci Eng 6:341–345
90
[18]Huang X et al (2021) Emotion detection for conversations based on
reinforcement learning framework. IEEE Multimed 28:76–85
[24]Seal D, Roy UK, Basak R (2020) Sentence-level emotion detection from text
based on semantic rules. In: Advances in intelligent systems and computing, vol
933. Springer, Singapore
91
[27]Deshpande A, Paswan R (2020) Real-time emotion recognition of twitter
posts using a hybrid approach. ICTACT J Soft Comput 6956:2125–2133
92