0% found this document useful (0 votes)
7 views67 pages

Sample Report

The document presents a minor project report on 'Email Classification Using RNN and Bidirectional LSTM layers' by students at SRM Institute of Science and Technology, supervised by Dr. Vani R. The project aims to develop an AI-driven system that classifies emails into 18 categories using advanced deep learning techniques, achieving a classification accuracy of 97%. The report includes an overview of the methodology, literature review, project description, and results, highlighting the advantages of using RNN and BiLSTM for effective email management.

Uploaded by

aa3396
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views67 pages

Sample Report

The document presents a minor project report on 'Email Classification Using RNN and Bidirectional LSTM layers' by students at SRM Institute of Science and Technology, supervised by Dr. Vani R. The project aims to develop an AI-driven system that classifies emails into 18 categories using advanced deep learning techniques, achieving a classification accuracy of 97%. The report includes an overview of the methodology, literature review, project description, and results, highlighting the advantages of using RNN and BiLSTM for effective email management.

Uploaded by

aa3396
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Email Classification Using RNN and Bidirectional LSTM layers

A MINOR PROJECT REPORT

Submitted by
Samvardhan Singh - RA2111043020005
Pramod Gururajan - RA2111043020006
Anumita R. Ajit - RA2111043020008
Nikhil Shivanath - RA2111043020037

Under the guidance of


Dr. Vani R.

in partial fulfilment for the award of the degree


of
BACHELOR OF TECHNOLOGY
in

ELECTRONICS AND COMPUTER ENGINEERING


of

FACULTY OF ENGINEERING AND TECHNOLOGY

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY


RAMAPURAM, CHENNAI -600089

NOV 2024
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
(Deemed to be University U/S 3 of UGC Act, 1956)

BONAFIDE CERTIFICATE

Certified that this minor project report titled “Email Classification


Using RNN and Bidirectional LSTM layers” is the bona-fide work of
Samvardhan Singh[Reg No: RA2111043020005], Pramod
Gururajan[Reg No: RA2111043020006], Anumita R. Ajit[Reg No:
RA2111043020008], Nikhil Shivanath[Reg No: RA2111043020037]
who carried out the minor project work under my supervision. Certified
further, that to the best of my knowledge, the work reported herein does
not form any other project report or dissertation on the basis of which a
degree or award was conferred on an occasion on this or any other
candidate.

SIGNATURE
Dr. Vani R
Professor and Head
Supervisor
Electronics & Computer Engineering,
SRM Institute of Science and
Technology,
Ramapuram, Chennai.

Submitted for the minor project viva-voce held on ………….. at SRM


Institute of Science and Technology, Ramapuram, Chennai -600089

INTERNAL EXAMINER 1 INTERNAL EXAMINER 2


SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
RAMAPURAM, CHENNAI - 89

DECLARATION

We hereby declare that the entire work contained in this minor project report titled “Email
Classification Using RNN and Bidirectional LSTM layers ” has been carried out by
Samvardhan Singh [Reg No: RA2111043020005], Pramod Gururajan [Reg No:
RA2111043020006], Anumita R. Ajit [Reg No: RA2111043020008], Nikhil Shivanath
[Reg No: RA2111043020037] at SRM Institute of Science and Technology, Ramapuram
Campus, Chennai- 600089, under the guidance of Dr. Vani R, Professor, Department of
Computer Science and Engineering.

Place: Chennai
Date: 26 Oct 2024 Samvardhan Singh

Pramod Gururajan

Anumita R. Ajit

Nikhil Shivanath
ABSTRACT

We have introduced an AI-driven approach for classifying emails into 18 distinct categories
using a Recurrent Neural Network (RNN) architecture with Bidirectional Long Short-Term
Memory (BiLSTM) layers, combined with GloVe pre-trained word embeddings. The model
efficiently categorizes emails across a wide range of types, including promotional, educational,
confirmational, personal, transactional, newsletters, feedback/surveys, alerts/notifications,
invitations, apologies/corrections, job applications/career-related, collaboration/partnership,
legal/compliance, internal communications, holiday greetings, reminders, unsubscribe, and
welcome emails. By leveraging contextual information from both forward and backward
directions, the model achieves a high classification accuracy of 97%, outperforming existing
methods. This makes it highly suitable for integration into email management systems, offering
a more streamlined and automated approach to email organization, reducing manual effort, and
improving productivity.

iv
TABLE OF CONTENTS

Page.No

ABSTRACT iv

LIST OF FIGURES vi

LIST OF TABLES vii

LIST OF ACRONYMS AND ABBREVIATIONS viii

1 INTRODUCTION 2
1.1 Introduction 2
1.2 Problem Statement 2
1.3 Objective of the Project 3
1.4 Project Domain 3
1.5 Scope of the Project 3
1.6 Methodology 4

2 LITERATURE REVIEW 5

3 PROJECT DESCRIPTION 7
3.1 Existing System 7
3.2 Proposed System 7
3.3 Feasibility Study 8
3.3.1 Economic Feasibility 9
3.3.2 Technical Feasibility 9
3.3.3 Social Feasibility 9
3.4 System Specification
3.4.1 Hardware Specification 10
3.4.2 Software Specification 10
4 PROPOSED WORK 11
4.1 Introduction .......................................................................................... 11
4.2 General Architecture ............................................................................ 13
4.3 Design Phase ........................................................................................ 15
4.3.1 Data Flow Diagram ............................................................... 15
4.3.2 UML Diagram ....................................................................... 16
4.3.3 Use Case Diagram ................................................................. 17
4.3.4 Sequence Diagram ................................................................ 18
4.4 Module Description .............................................................................. 19
4.4.1 Data Collection and Training Data 19
4.4.2 Step:1 Data collecting ........................................................... 19
4.4.3 Step:2 Processing of data ...................................................... 19
4.4.4 Step:3 Split the Data.............................................................. 20
4.4.5 Datasets Sample ..................................................................... 20
4.4.6 Step:4 Building the Model .................................................... 21

5 IMPLEMENTATION AND TESTING 22

5.1 Testing………... …………………………………………………… 23


5.2 Implementation.. ……………………………………………………. 25

6 RESULTS AND DISCUSSIONS 26


6.1 Efficiency of the Proposed System......................................................... 26

7 CONCLUSION AND FUTURE ENHANCEMENTS 28


7.1 Conclusion ............................................................................................. 28
7.2 Future Enhancements 28

8 SOURCE CODE 30
8.1 Sample Code 30

References 32
LIST OF FIGURES

4.1 Distribution of Email Categories ................................................................ 12

4.2 Architecture Diagram .................................................................................. 13

4.3 Data Flow Diagram ...................................................................................... 15

4.4 UML Diagram .............................................................................................. 16

4.5 Use Case Diagram ........................................................................................ 17

4.6 Sequence Diagram ....................................................................................... 18

4.7 Preprocessing of Data .................................................................................. 20

4.8 Dataset of 1800 emails of 18 classes ............................................................. 20

5.1 Confusion Matrix ......................................................................................... 23


5.2 Model Training ............................................................................................ 24

5.3 Model Accuracy Curve ............................................................................... 24

6.1 Model Loss .................................................................................................... 26


LIST OF TABLES

5.1 Precision, Recall, F1 score and support of each category ................................ 25


6.1 Comparison of Existing and Proposed System ................................................. 27
LIST OF ACRONYMS AND ABBREVIATIONS

BiLSTM BIDERECTIONAL LONG SHORT


TERM MEMORY

RNN RECURRENT NUERAL NETWORK

GloVe GLOBAL VECTORS FOR WORD


REPRESENTATION
NLP NATURAL LANGUAGE PROCESSOR

SVM SUPPORT VECTOR MACHINES

1
Chapter 1

INTRODUCTION

1.1 Introduction

Email communication plays a significant role in both personal and


professional settings, but managing the sheer volume of emails has become
increasingly challenging. As the types of emails diversify—from promotional
offers to legal notifications—the need for automated solutions to efficiently
classify, filter,and prioritise messages has grown. Traditional rule-based
systems, while functional, often fail to account for the complexity and
variability of modern email content, resulting in misclassifications and
inefficiencies. In contrast, deep learning techniques, particularly Recurrent
Neural Networks (RNNs), provide a more dynamic and robust solution by
capturing the semantic and contextual nuances of email content, adapting to
evolving patterns and offering a higher degree of classification accuracy.
These advanced methods can significantly improve workflow efficiency,
especially in environments where timely and accurate email sorting is critical.

1.2 Problem Statement


Firstly, we need to understand the problem in an efficient way. Managing the
growing volume and variety of emails has become increasingly challenging,
with traditional rule-based systems often misclassifying emails due to their
inability to capture complex and evolving email content. This leads to
inefficiencies in email sorting, especially in environments where timely and
accurate classification is critical.
2
1.3 Objective of the Project

The objective is to develop an AI-driven email classification system that


leverages Recurrent Neural Networks (RNNs) with Bidirectional Long Short-
Term Memory (BiLSTM) layers and GloVe word embeddings to accurately
classify emails into 18 distinct categories, improving the efficiency and
accuracy of email management.

1.4 Project Domain

This project falls within the domains of Natural Language Processing (NLP)
and Machine Learning, focusing specifically on email classification. It aims
to enhance automated systems for organizing and prioritizing emails in both
personal and professional communication settings.

1.5 Scope of the Project

The project covers the development, training, and evaluation of a deep


learning model capable of classifying emails into 18 predefined categories. It
offers a solution for integration into email management systems, aiming to
streamline email organization, reduce manual effort, and improve productivity
in environments handling large volumes of diverse email content.

3
1.6 Methodology

The project employs a Recurrent Neural Network (RNN) architecture,


specifically focusing on Bidirectional Long Short-Term Memory (BiLSTM)
layers, to efficiently classify emails into distinct categories. RNNs are well-
suited for sequential data processing, and BiLSTM enhances this by
processing information in both forward and backward directions, which
allows the model to understand the contextual relationships between words in
an email, no matter their order in the sentence. This is especially useful for
complex email types where the context of a word depends on both preceding
and following words.
The model uses pre-trained GloVe (Global Vectors for Word Representation)
embeddings to convert the words in the emails into numerical vectors, which
serve as input for the BiLSTM layers. GloVe embeddings are powerful
because they capture semantic relationships between words based on their co-
occurrence in large corpora, allowing the model to generalize better even for
unseen word combinations. Each word in the email is transformed into a high-
dimensional vector, which then gets passed through two stacked BiLSTM
layers.
For training, the dataset is split into 80% for training and 20% for testing,
ensuring that the model has sufficient data to learn and generalize effectively.
The Adam optimizer is used to fine-tune the learning process, adjusting the
model's weights based on the first and second moments of the gradients, which
helps speed up convergence and improves accuracy. The model's performance
is monitored using metrics like accuracy, precision, recall, and F1-score to
ensure robust classification across all 18 categories.

4
Chapter 2

LITERATURE REVIEW

1. Amudha Poobalan et al. [1] developed a secured email


classification system using a deep neural network with BiLSTM and
hybrid encryption (AES-Rabbit) for enhanced cloud security,
outperforming other classifiers.
2. Alejandro De La Noval et al. [2] showed that large language
models like BERT improved spam classification accuracy,
demonstrating their effectiveness in handling complex email data.
3. Bandar Alshawi et al. [3] applied NLP techniques and BERT,
achieving high accuracy in spam detection, proving the capability of
advanced NLP models.
4. Congying Dai [4] compared deep learning algorithms, introducing
new methods that enhanced accuracy and spam detection.
5. Dr. K. Parthiban et al. [5,6] introduced a BiLSTM-based hybrid
deep neural network for email classification, including emotion
detection and phishing email handling with hybrid encryption for data
security.
6. Harshini B.V, Vadivazhagi S. [7] proposed a machine learning
system for automated email sorting, focusing on workflow efficiency.
7. Jiacheng Xu [8] evaluated machine learning models for spam
classification, suggesting future improvements in accuracy.
8. Marouane Zaizoune et al. [9] used supervised learning to classify
emails on user-defined categories, enhancing email management.
5
9. N. Saraswathi et al. [10] compared seven machine learning
approaches for spam classification, providing insights into classifier
performance.
10. Pallavi N, P. Jayarekha [11] achieved high accuracy in spam
classification using machine learning algorithms with feature
engineering.
11. Qianhe Ouyang et al. [12] compared KNN and Naive Bayes,
finding KNN superior in spam detection.
12. Shao Bo Cheng [13] highlighted Naive Bayes’ effectiveness in
spam clustering, comparing it with classifiers used by Google and
Yahoo.
13. Srinivasa Rao Dangeti et al. [14] enhanced spam classification
accuracy using machine learning and feed-forward neural networks.
14. Wolfgang Grundmann [15] studied SMS and email spam
classification, achieving high accuracy with NLP and machine
learning models.
15. Wahab A. Iddrisu [16] used Random Forest for content-based
academic email spam classification, achieving 94.2% accuracy.
16. Xiaoke Wang [17] found LSTMs superior to traditional methods
for spam classification.
17. Yi-Chu Huang [18] compared models like CNN, RNN, LSTM,
and Naive Bayes, concluding that deep learning models outperform
traditional ones.
18. Zhenghao Miao [19] compared six models for spam classification,
achieving 98.7% accuracy.
19. Zili Luo, Farhana Zulkernine [20] proposed a CNN-BiLSTM
model for multiclass email categorization, outperforming other
classifiers.
6
Chapter 3

PROJECT DESCRIPTION

3.1 Existing System

Traditional email classification methods primarily use rule-based


systems or machine learning models such as Naive Bayes and Support
Vector Machines (SVM). While these approaches work for simple
tasks like spam detection, they struggle with complex tasks involving
semantic and contextual understanding of email content. Rule-based
systems rely on static keywords, leading to inaccuracies, and machine
learning models do not capture the sequential nature of text, often
failing to account for word dependencies and long-term patterns in
emails..

3.2 Proposed System

The proposed system utilizes a Recurrent Neural Network (RNN)


with Bidirectional Long Short-Term Memory (BiLSTM) layers to
significantly enhance email classification compared to traditional
rule-based or machine learning systems. By leveraging GloVe word
embeddings, the model transforms words into dense vector
representations, capturing their semantic relationships based on co-
occurrence in a large corpus. The BiLSTM layers process email
content in both forward and backward directions, ensuring that the
7
model captures contextual meaning from surrounding words, which
is crucial for accurately classifying complex and context-sensitive
emails. This architecture enables the model to classify emails into 18
distinct categories, including promotional, legal, transactional, and
more, with a high level of accuracy. Additionally, the integration of
techniques like dropout, batch normalization, and early stopping
prevents overfitting and ensures robust performance, achieving an
impressive classification accuracy of 97%. This makes the system
both efficient and reliable for real-world email classification tasks.
Advantages

• Improves classification accuracy by capturing email sequence and context.


• Handles complex emails using BiLSTM layers and GloVe embeddings.

• Achieves accurate classification across various email types.

• Efficient, scalable, and adaptable to changing email formats.

3.3 Feasibility Study

A Feasibility study is carried out to check the viability of the


project and to analyze the strengths and weaknesses of the proposed
system. The feasibility study is carried out in three forms

• Economic Feasibility

• Technical Feasibility

• Social Feasibility

8
3.3.1 Economic Feasibility

The system is economically feasible due to its use of pre-trained


GloVe embeddings, which reduces the need for extensive
computational resources during the training phase. While the initial
setup may require hardware with moderate processing capabilities,
the cost is justified by the increased productivity and reduced manual
effort in email management. The system’s adaptability ensures that it
will remain effective with minimal retraining, making it cost-efficient
in the long term.

3.3.2 Technical Feasibility

The project is technically feasible as it builds upon well-established


deep learning techniques such as BiLSTM and RNN, which have
been widely applied in natural language processing tasks. The use of
GloVe embeddings enhances the model's ability to understand word
semantics, while dropout and batch normalization techniques ensure
robustness. The system can be integrated into existing email
platforms with minimal technical challenges, given the availability of
libraries like TensorFlow or PyTorch for implementation.

3.3.3 Social Feasibility

The automated email classification system addresses a widespread


issue of managing high email volumes in both personal and
professional settings. By reducing manual effort and improving
productivity, the system has a positive social impact. It can be applied
in various industries where timely and accurate email sorting is
crucial, including legal, education, and customer service sectors.
Moreover, by improving workflow efficiency, the system can reduce
employee stress related to email overload.

9
3.4 System Specification

3.4.1 Hardware Specification

• The hardware requirements for the project include a system with a


multi-core processor, at least 16GB of RAM, and a GPU for
efficient training of the BiLSTM model. A storage capacity of
500GB is recommended to handle large email datasets and pre-
trained embeddings. The system should also support a compatible
deep learning framework, such as TensorFlow or PyTorch, for
model training and evaluation.

3.4.2 Software Specification

• The software requirements include Python as the primary


programming language, along with libraries such as TensorFlow or
PyTorch for deep learning, NumPy and Pandas for data
manipulation, and GloVe for pre-trained word embeddings. Jupyter
Notebook or any IDE can be used for code development and testing.
The system should also have access to an email dataset, and libraries
such as Scikit-learn can be used for data preprocessing and
evaluation.

10
Chapter 4

PROPOSED WORK

4.1 Introduction

The study "Email Classification Using RNN and Bidirectional LSTM layers"
describes a novel AI-based technique for email classification that employs a
Recurrent Neural Network (RNN) architecture with Bidirectional Long Short-
Term Memory (BiLSTM) layers, supplemented with GloVe pre-trained word
embedding. The suggested algorithm classifies emails into 18 different
categories, ranging from promotional and transactional emails to legal notices
and internal communications, with an accuracy rate of 97%. By analyzing text
sequences in both forward and backward directions, the BiLSTM model
improves its knowledge of contextual and semantic subtleties in the text,
which is required for appropriately classifying various email kinds.

Traditional email categorization systems have relied on rule-based approaches


and simpler machine learning techniques, such as Naive Bayes and Support
Vector Machines (SVM), but these models frequently fail to generalize
effectively across complex and developing email formats. Deep learning
models such as RNNs, particularly BiLSTMs, solve these issues by
successfully capturing long-term relationships and sequence patterns in text,
which is critical for comprehending context inside emails. GloVe embeddings
improve the model by giving rich, pre-trained vector representations of
words, allowing it to grasp word semantics in a larger context, including
previously undiscovered word combinations.

11
To prevent overfitting, the model architecture comprises embedding layers,
BiLSTM processing, and thick layers, as well as regularization techniques
like as dropout and batch normalization. It also uses an adaptive optimizer
(Adam) and categorical cross-entropy for multi-class classification. This
technique represents a substantial development in automated email
categorization, decreasing the human work required for email sorting and
increasing productivity by properly managing complicated email types in
both personal and commercial settings. Figure 4.1 represents the distribution
of email categories within the dataset.

Figure 4.1: Distribution of Email Categories

12
4.2 General Architecture

Figure 4.2: Architecture Diagram

Figure 4.2 represents the detailed architecture of the RNN model with
BiLSTM layers for email classification. The flowchart defines the following:

1. User Input → Data Preprocessing:


a. Converting text to lowercase
b. Removing HTML tags
c. Removing URLs
d. Removing special characters
e. Removing stopwords

13
2. Preprocessing includes:
a. Padding: Standardizing input length
b. Feature Extraction
c. Tokenization: Converting preprocessed text into sequence of integers
3. Embedding Layer:
a. Uses GloVe pre-trained embeddings (100-dimensional vectors)
b. Converts tokenized text into word embeddings
4. BiLSTM Layers:
a. BiLSTM Layer 1: 128 units
b. BiLSTM Layer 2: 64 units
c. Processes text in both forward and backward directions
5. Model Components:
a. Dropout layer: For preventing overfitting (30% dropout rate)
b. Batch Normalization: For stabilizing training
c. Regularization: Additional measure against overfitting
6. Dense Layers:
a. Dense Layer: 64 units with ReLU activation
b. Output Layer: 18 units (corresponding to email categories) with
softmax activation
7. Final Steps:
a. Load Trained Model
b. Predict Email Category
c. Display Results

14
4.3 Design Phase
4.3.1 Data Flow Diagram

Figure 4.3: Data Flow Diagram

Figure 4.3 shows a flow diagram of the email classification system's process
flow. It illustrates:
1. The process starts with "User Input" at the top
2. Branches into two parallel paths:
a. Left path: Goes through Raw E-mail text → Preprocessing →
Make Plan Adjustments → Monitor Project Progress
b. Right path: Goes through Feature Extraction → TF-IDF Calculation →
Make Plan Adjustments → Monitor Project Progress
3. Both paths converge into the "Classification model"
4. Finally leads to "Predicated E-mail category" (note: there appears to be a
typo, should be "Predicted")

15
4.3.2 UML Diagram

Figure 4.4: UML Diagram

Figure 4.4 shows the class diagram of the email classification system
architecture with two main components:

1. E-MAIL CLASSIFIER class:


a. Attributes: Preprocessor, Embedding Layer, bilstm_model,
dense_layer
b. Methods: preprocess_text(), tokenize(), pad_sequence(),
embed_text(), classify_email()
2. BI-LSTM MODEL class:
a. Attributes:
i. lstm_units1: int
ii. lstm_units2: int
iii. dropout_rate: float
b. Methods: forward(), train(), predict()

16
4.3.3 Use Case Diagram

Figure 4.5: Use Case Diagram

Figure 4.5 illustrates the overall flow of the email classification system,
where the user inputs an email that undergoes preprocessing to clean and
prepare the text. The system then classifies the email using the RNN/BiLSTM
model, which can also be updated through training with new data to improve
performance. The user can view the classified category results and interact
with the model as needed.

17
4.3.4 Sequence Diagram

Figure 4.6: Sequence Diagram

Figure 4.6 represents a step-by-step sequence, starting with the user's email
input. The email is cleaned, tokenized, padded for uniform length, and then
transformed using word embeddings. It is processed through the Bi-LSTM
model, and the output is passed through a dense layer and a softmax layer for
classification. The predicted email category is finally returned to the user.

18
4.4 Module Description

Our entire project is divided into two modules.


4.4.1 Data Collection and Training Data

Data Collection and training using Machine Learning Algorithms


4.4.2 Step:1 Data collecting

• Identify Data Sources: Collect emails from reliable sources, like publicly
available datasets (e.g., Enron email dataset, spam classification datasets)
or organizational email systems with permission.
• Category Selection: Define clear email categories (e.g., promotional,
transactional, personal) to guide data labeling.
• Data Labeling: Use manual or automated labeling methods to assign
emails to categories, ensuring high-quality labeled data.
• Data Diversity: Ensure the dataset represents various email types, topics,
and formats to improve the model’s robustness and generalization.

4.4.3 Step:2 Processing of data

Figure 4.7 represents a step-by-step process of preprocessing the dataset


before model training.
• Data Cleaning: Remove unwanted characters, HTML tags, hyperlinks,
and non-essential metadata to standardize text.
• Normalization: Convert text to lowercase, and apply lemmatization or
stemming to unify word forms.
• Tokenization: Split the text into individual tokens (words or phrases) for
easier processing by the model.
• Remove Stop Words: Remove common stop words (like "the," "and") to
reduce noise in the data.
• Word Embedding/Vectorization: Convert tokens to numerical form
using pre-trained embeddings (e.g., GloVe, Word2Vec) or encoding
methods to represent semantic relationships.
• Train-Validation-Test Split: Divide the processed data into training,
validation, and test sets to prevent overfitting and assess model
19
performance.

Figure 4.7: Preprocessing of Data

4.4.4 Step:3 Split the Data

After the pre-processing part, the dataset was split into 80% for training and 20%
for testing. This split helps to train the model on the majority of the data while
reserving a portion for evaluating its performance on unseen samples, ensuring
better generalization. Figure 4.8 shows a sample of the split dataset.

4.4.5 Datasets Sample

Figure 4.8: Dataset of 1800 emails of 18 classes


20
4.4.6 Step:4 Building the Model

• Data Preprocessing: Clean the email text data (remove special characters,
convert to lowercase, etc.), then tokenize and vectorize it using pre-trained
embeddings like GloVe.
• Model Architecture: Choose a recurrent neural network (RNN)
architecture with BiLSTM layers to handle the text's sequential nature. Add
an embedding layer, followed by stacked BiLSTM layers with dropout and
batch normalization to prevent overfitting.
• Output Layer: Set a dense layer with a softmax activation to classify into
multiple categories.

21
Chapter 5

TESTING AND IMPLEMENTATION

5.1 Testing

Testing is the process of evaluating a system or its component(s) with the intent
to find whether it satisfies the specified requirements or not.
1. Performance Metrics: To comprehensively evaluate the model’s classification
accuracy, a combination of performance metrics is used, such as:
• Accuracy: Measures the proportion of total correct predictions, providing an
overall sense of model accuracy.
• Precision: Evaluates the ratio of true positive predictions to the total predicted
positives, ensuring the model’s relevance when identifying specific classes.
• Recall: Determines the proportion of true positives identified from the actual
positives, assessing the model's effectiveness in capturing all relevant
instances.
• F1-Score: Balances precision and recall, providing a harmonic mean that
highlights the model’s ability to maintain accuracy across diverse categories. A
high F1-score indicates robust, well-rounded classification performance,
particularly in scenarios with imbalanced datasets.
2. Confusion Matrix Analysis: Check the confusion matrix to understand
category-wise classification accuracy.
3. Fine-Tuning: Adjust and retrain the model if certain categories show low
performance to improve accuracy and generalization.

22
Figure 5.1: Confusion Matrix

For training, the dataset is divided with 80% used for training and 20% for
testing. The Adam optimizer, a commonly used variant of stochastic gradient
descent, is employed as it adapts the learning rate based on the first and second
moments of the gradients. The loss function us ed is categorical cross-entropy,
ideal for multi-class classification tasks like this, where more than two
categories exist. Early stopping is applied during training to halt the process
if the validation loss does not improve for three consecutive epochs,
preventing overfitting and ensuring the model does not learn noise from the
data. The model is trained for up to 15 epochs with a batch size of 32, where
each epoch represents a full pass through the training data, and the batch size
refers to the number of samples processed before the model updates its
weights. Figure 5.1 represents the confusion matrix to understand the class
wise accuracy of the model.

23
Figure 5.2: Model Training

Figures 5.2 and 5.3 display the training and validation curves for the model's
accuracy and loss. The model stops training after three consecutive epochs
with no substantial improvement in validation loss, which is done using early
stopping. The F1-score is a balanced measure of performance by taking the
harmonic mean of accuracy and recall. Recall indicates how many positive
examples were properly predicted by the model.

Figure 5.3: Model Accuracy Curve

24
Table 5.1. Precision, Recall, F1 score and support of each category

Category Precision Recall F1-Score Support


1.promotions 1.00 1.00 1.00 24
2.educational 0.95 1.00 0.98 20
3.confirmational 0.92 0.92 0.92 12
4.personal 0.96 1.00 0.98 22
5.Transactional 1.00 1.00 1.00 22
6.newsletter 1.00 1.00 1.00 17
7.feedback 1.00 1.00 1.00 21
8.alert 1.00 1.00 1.00 19
9.invitation 0.96 0.96 0.96 25
10.apology 0.95 1.00 0.98 21
11.career 1.00 1.00 1.00 27
12.collaboration 1.00 1.00 1.00 15
13.legal 1.00 1.00 1.00 22
14.internal 0.95 0.95 0.95 19
15.holiday 0.90 0.95 0.95 19
16.reminder 0.90 0.95 0.92 15
17.unsubscribe 0.94 0.83 0.88 18
18.welcome 1.00 1.00 1.00 22

In summary, this proposed model offers a robust solution for classifying


emails into multiple categories using an RNN architecture with BiLSTM
layers and GloVe embeddings. By leveraging advanced techniques like
dropout, batch normalisation, and early stopping, the model generalizes
effectively to fresh data and achieves excellent classification accuracy across
a variety of email formats.

Implementing the model.


• Training Configuration: Split the dataset into training and testing sets
(typically 80-20). Use categorical cross-entropy as the loss function and an
optimizer like Adam. Train for a set number of epochs with early stopping
to avoid overfitting.
• Model Training: Run the training process, monitor training and validation
loss/accuracy, and adjust hyperparameters if necessary.
25
Chapter 6
RESULTS AND DISCUSSION

6.1 Efficiency of the Proposed System


The suggested model was trained using a dataset of 1800 emails divided into
18 separate categories. After preprocessing the data and dividing it into
training and testing sets (80% training and 20% testing), the model was trained
for a maximum of 15 epochs with a batch size of 32, using the Adam optimizer
with categorical cross-entropy as the loss function. Early stopping was used
to prevent overfitting by interrupting the training process after three
consecutive epochs with no significant improvement in validation loss.

Figure 6.1: Model Loss

The model's performance was tested using measures like asaccuracy,


precision, recall, and F1-score. The model obtained an astounding 97%
accuracy on the test set, proving its ability to categorize emails across all 18
categories. The high accuracy suggests that the model successfully trained to
distinguish between distinct sorts of emails, such as promotional,
transactional, personal, and legal/compliance emails. Figure 6.1 represents the
loss curves for the model during training and validation.
26
Table 6.1. Comparison of Existing and Proposed System

Model Algorithm Accuracy


(%)
Naive Bayes Probability 87.5
Classifier
SVM Support Vector 90.1
Machine
CNN Convolutional 92.7
Neural Net
LSTM Long Short-Term 95.0
BiLSTM Bidirectional 97.0
(proposed) LSTM + GloVe

Above is a comparison table of various algorithms and models used in


the previously referenced research papers. In conclusion, the results
demonstrate the effectiveness of combining Bidirectional LSTM layers
with pre- trained GloVe embeddings for email classification. The model
not only outperformed traditional methods like Naive Bayes and SVM
but also showed substantial improvements in handling complex and
context-dependent email categories. While there are areas for
improvement, particularly in categories with overlapping content, the
overall performance of the model highlights its potential for real-world
applications in automating email categorization in business and personal
contexts.

27
Chapter 7
CONCLUSION AND FUTURE
ENHANCEMENTS
7.1 Conclusion
The suggested model, which uses a Recurrent Neural Network (RNN)
with Bidirectional Long Short-Term Memory (BiLSTM) layers and GloVe
embeddings, demonstrated excellent accuracy in categorizing emails into 18
separate categories while effectively handling both basic and complicated
email kinds. Using the BiLSTM architecture, the model gathered contextual
information from both previous and following terms in the text, increasing its
ability to distinguish meaning and context, particularly in delicate categories
such as legal/compliance and collaborative communications. With a
classification accuracy of 97%, the model outperformed standard approaches
like Naive Bayes and SVM, indicating its potential for real-world use in
automated email sorting and management. The model's resilience makes it
ideal for contexts where accurate email classification is critical for workflow
management.

7.2 Future Enhancements

Although the present model provides good performance, there are areas for
future enhancement to increase its versatility and accuracy:
1) Expansion of Email Categories: By including more detailed subcategories
inside current classes, such as dividing promotional emails into multiple
sorts (e.g., seasonal promotions, tailored offers), customers may receive
28
more specialized email classification.
2) Handling Mixed-Content Emails: Improving the model's ability to
recognize and categorize emails with several content kinds (for example, a
promotional email with legal disclaimers) will help it handle complicated
email structures better.
3) Sentiment Analysis: By incorporating sentiment analysis layers, the model
might give emotional context, such as urgency or purpose, which would be
useful in ranking emails.
4) Improving Generalization with Transfer Learning: Exploring pre-trained
big language models or applying transfer learning techniques may make the
model more adaptive to new or unknown email categories with minimum
retraining.
5) Real-Time Processing Optimization: Additional optimizations, such as
decreasing computing complexity or utilizing cloud-based processing,
would enable smooth, real-time email categorization, which is especially
useful in business situations with huge email volumes.

29
Chapter 8
SOURCE CODE
8.1 Sample Code
import tensorflow as tf
print(tf.__version__)

!pip install keras

import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint

nltk.download('stopwords')

# Load the dataset


df = pd.read_csv("Dataset.csv", header=None)

# Renaming columns
df.columns = ['email_type', 'content']

# Check for invalid label values


invalid_labels = df[~df['email_type'].isin(range(1, 19))]
if not invalid_labels.empty:
print("Invalid label values found:")
print(invalid_labels)

# Remove rows with invalid label values


df = df[df['email_type'].isin(range(1, 19))]

# Data preprocessing
def preprocess_text(text):
text = text.lower()
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'<.*?>', '', text) # Remove HTML tags
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove special characters (retain numbers)
text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])
30
# Remove stopwords
return text

df['content'] = df['content'].apply(preprocess_text)

# Tokenization and Padding


tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['content'])
X_seq = tokenizer.texts_to_sequences(df['content'])
max_length = 500
X_pad = pad_sequences(X_seq, maxlen=max_length, padding='post')

# Update labels
y = df['email_type'].values - 1 # Convert labels to 0-based indices

# Split dataset into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X_pad, y, test_size=0.2, random_state=42)

# Define the model


embedding_dim = 300
vocab_size = len(tokenizer.word_index) + 1

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.5))
model.add(Dense(18, activation='softmax'))

# Compile the model


optimizer = Adam(learning_rate=0.001)
model.compile(optimizer=optimizer,loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model


early_stop = EarlyStopping(monitor='val_loss', patience=3)
checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True)
history = model.fit(X_train, y_train, epochs=45, batch_size=32, validation_data=(X_test, y_test),
callbacks=[early_stop, checkpoint])

# Evaluate the model


loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}, Test Accuracy: {accuracy}")

# Save tokenizer and model


model.save("email_classification_model.h5")
tokenizer_json = tokenizer.to_json()
with open("tokenizer.json", "w") as json_file:
json_file.write(tokenizer_json)

31
References
Material Type Works Cited

Book in print [1] J. Clerk Maxwell, A Treatise on Electricity and Magnetism,


3rd ed., vol. 2. Oxford: Clarendon, 1892, pp. 68–73.
Journal article [1] Poobalan, A., et al. "A novel and secured email classification
using deep neural network with bidirectional long short-term
memory." Computer Speech & Language, vol. 89, 2025, p.
101667.
[5] Krishnamoorthy, Parthiban, et al. "A novel and secured
email classification and emotion detection using hybrid deep
neural network." International Journal of Cognitive Computing in
Engineering, vol. 5, 2024, pp. 44-57.
[7] Harshini, B. V. "Enhancing Workflow Efficiency through
Machine Learning-Based Email Sorting." Journal of Soft
Computing and Computational Intelligence, vol. 1, no. 2, 2024,
pp. 13-24.
[3] Alshawi, Bandar, et al. "Classification of SPAM mail utilizing
machine learning and deep learning techniques." International
Journal on Information Technologies and Security, vol. 16, no.
2, 2024, pp. 71-82.
[19] Douzi, Samira, et al. "Hybrid email spam detection model
using artificial intelligence." International Journal of Machine
Learning and Computing, vol. 10, no. 2, 2020.
eJournal (from [20] Miao, Zhenghao. "Efficient Spam Classification Using
internet) Machine Learning Methods." Highlights in Science, Engineering
and Technology, vol. 34, 2023, pp. 60-64.

Conference [2] De La Noval, Alejandro, et al. "Methodologies for Email


paper Spam Classification Using Large Language Models." In 2023
International Conference on Computational Science and
Computational Intelligence (CSCI), pp. 179-185. IEEE, 2023.
[6] Savitha, G., et al. "Advanced Email Spam Detection: A
Machine Learning Solution." In 2023 International Conference
on Evolutionary Algorithms and Soft Computing Techniques
(EASCT), pp. 1-5. IEEE, 2023.
[8] Xu, Jiacheng. "Automatic Classification and Analysis of
Spam Based on Machine Learning." In 2023 International
Conference on Industrial IoT, Big Data and Supply Chain
(IIoTBDSC), pp. 28-32. IEEE Computer Society, 2023.

32
[9] Zaizoune, Marouane, et al. "Automatic emails classification."
In 2023 10th International Conference on Wireless Networks
and Mobile Communications (WINCOM), pp. 1-4. IEEE, 2023.
[10] Saraswathi, N., et al. "Email Spam Classification and
Detection using Various Machine Learning Classifiers." In 2024
International Conference on Advances in Computing,
Communication and Applied Informatics (ACCAI), pp. 1-7.
IEEE, 2024.
[11] Pallavi, N., and P. Jayarekha. "Efficient Spam Email
Classification Using Machine Learning Algorithms." In 2023 7th
International Conference on Computation System and
Information Technology for Sustainable Solutions (CSITSS), pp.
1-6. IEEE, 2023.
[12] Ouyang, Qianhe, et al. "E-mail Spam Classification using
KNN and Naive Bayes." Highlights in Science, Engineering and
Technology, vol. 38, 2023, pp. 57-63.
[13] Cheng, Shaopeng. "Classification of Spam E-mail based on
Naïve Bayes Classification Model." Highlights in Science,
Engineering and Technology, vol. 39, 2023, pp. 749-753.
[14] Dangeti, Srinivasa Rao, et al. "Classification Analysis for e-
mail Spam using Machine Learning and Feed Forward Neural
Network Approaches." In International Conference on
Computational Innovations and Emerging Trends (ICCIET-
2024), pp. 66-75. Atlantis Press, 2024.
[16] Bari, Prince, et al. "SMS and E-mail Spam Classification
Using Natural Language Processing and Machine Learning." In
International Conference on Communication, Electronics and
Digital Technology, pp. 103-115. Singapore: Springer Nature
Singapore, 2023.
[17] Iddrisu, Wahab Abdul, et al. "Content-Based Spam
Classification of Academic E-mails: A Machine Learning
Approach." In Advances in Information Communication
Technology and Computing: Proceedings of AICTC 2022, pp.
83-92. Singapore: Springer Nature Singapore, 2023.

33
[1] Poobalan, A., K. Ganapriya, K. Kalaivani, and K. Parthiban. "A novel and secured email
classification using deep neural network with bidirectional long short-term memory." Computer
Speech & Language 89 (2025): 101667.J. Clerk Maxwell, A Treatise on Electricity and
Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
[2] De La Noval, Alejandro, Diana Gutierrez, Jayesh Soni, Himanshu Upadhyay, Alexander Perez-
Pons, and Leonel Lagos. "Methodologies for Email Spam Classification Using Large Language
Models." In 2023 International Conference on Computational Science and Computational
Intelligence (CSCI), pp. 179-185. IEEE, 2023.
[3] Alshawi, Bandar, Amr Munsh, Majid Alotaibi, Ryan Alturki, and Nasser Allheeib.
"Classification of SPAM mail utilizing machine learning and deep learning techniques."
International Journal on Information Technologies and Security 16, no. 2 (2024): 71-82R.
[4] Dai, Na, Brian D. Davison, and Xiaoguang Qi. "Looking into the past to better classify web
spam." In Proceedings of the 5th international workshop on adversarial information retrieval
on the web, pp. 1-8. 2009.
[5] Krishnamoorthy, Parthiban, Mithileysh Sathiyanarayanan, and Hugo Pedro Proença. "A novel
and secured email classification and emotion detection using hybrid deep neural network."
International Journal of Cognitive Computing in Engineering 5 (2024): 44-57.
[6] Savitha, G., K. Hithyshi, J. Harshitha, Dhanya Shree JN, and Pratiksha Soori. "Advanced Email
Spam Detection: A Machine Learning Solution." In 2023 International Conference on
Evolutionary Algorithms and Soft Computing Techniques (EASCT), pp. 1-5. IEEE, 2023.
[7] Harshini, B. V. "Enhancing Workflow Efficiency through Machine Learning-Based Email
Sorting."Journal of Soft Computing and Computational Intelligence (e-ISSN: 3048-6610) 1, no.
2 (2024): 13- 24.
[8] Xu, Jiacheng. "Automatic Classification and Analysis of Spam Based on Machine Learning." In
2023 International Conference on Industrial IoT, Big Data and Supply Chain (IIoTBDSC), pp.
28-32. IEEE Computer Society, 2023.
[9] Zaizoune, Marouane, Youssef Fakhri, and Siham Boulaknadel. "Automatic emails
classification." In 2023 10th International Conference on Wireless Networks and Mobile
Communications (WINCOM), pp. 1-4. IEEE, 2023.
[10] Saraswathi,N.,S.Pradeep,V.Sathiyavathi,K.Sabitha,andK.Rajesh Kambattan. "Email Spam
34
Classification and Detection using Various Machine Learning Classifiers." In 2024
International Conference on Advances in Computing, Communication and Applied Informatics
(ACCAI), pp. 1-7. IEEE, 2024.
[11] Pallavi, N., and P. Jayarekha. "Efficient Spam Email Classification Using Machine Learning
Algorithms." In 2023 7th International Conference on Computation System and Information
Technology for Sustainable Solutions (CSITSS), pp. 1-6. IEEE, 2023
[12] Ouyang, Qianhe, Jiahe Tian, and Jiale Wei. "E-mail Spam Classification using KNN and
Naive Bayes." Highlights in Science, Engineering and Technology 38 (2023): 57-63.
[13] Cheng, Shaopeng. "Classification of Spam E-mail based on Naïve Bayes Classification
Model." Highlights in Science, Engineering and Technology 39 (2023): 749-753.
[14] Dangeti,SrinivasaRao,DileepKumarKadali,YesujyothiYerramsetti, Ch Raja Rajeswari, D.
Venkata Naga Raju, and Srinath Ravuri. "Classification Analysis for e-mail Spam using
Machine Learning and
[15] Feed Forward Neural Network Approaches." In International Conference on Computational
Innovations and Emerging Trends (ICCIET-2024), pp. 66-75. Atlantis Press, 2024.
[16] Bari,Prince,VimalaMathew,SuchiPrabhuTandel,PadvariyaAniket, Kishor S. Chaudhari, and
Swapnali Naik. "SMS and E-mail Spam Classification Using Natural Language Processing and
Machine Learning." In International Conference on Communication, Electronics and Digital
Technology, pp. 103-115. Singapore: Springer Nature Singapore, 2023.
[17] Iddrisu, Wahab Abdul, Sylvester Kwasi Adjei-Gyabaa, and Isaac Akoto. "Content-Based
Spam Classification of Academic E-mails: A Machine Learning Approach." In Advances in
Information
[18] Communication Technology and Computing: Proceedings of AICTC 2022, pp. 83-92.
Singapore: Springer Nature Singapore, 2023.
[19] Douzi,Samira,FedaA.AlShahwan,MouadLemoudden,andBouabid El Ouahidi. "Hybrid email
spam detection model using artificial intelligence." International Journal of Machine Learning
and Computing 10, no. 2 (2020).
[20] Miao, Zhenghao. "Efficient Spam Classification Using Machine Learning Methods."
Highlights in Science, Engineering and Technology34 (2023): 60-64.Zili Luo, Farhana
Zulkernine, "An Intelligent Email Classification System Using CNN-BiLSTM."
[21] Miao, Zhenghao. "Efficient Spam Classification Using Machine Learning Methods."
Highlights in Science, Engineering and Technology 34 (2023): 60-64
[22] Luo,Zili,andFarhanaZulkernine."AnIntelligentEmailClassification System." In 2023 IEEE
Symposium Series on Computational Intelligence (SSCI), pp. 1126-1131. IEEE, 2023.

35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy