Sample Report
Sample Report
Submitted by
Samvardhan Singh - RA2111043020005
Pramod Gururajan - RA2111043020006
Anumita R. Ajit - RA2111043020008
Nikhil Shivanath - RA2111043020037
NOV 2024
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
(Deemed to be University U/S 3 of UGC Act, 1956)
BONAFIDE CERTIFICATE
SIGNATURE
Dr. Vani R
Professor and Head
Supervisor
Electronics & Computer Engineering,
SRM Institute of Science and
Technology,
Ramapuram, Chennai.
DECLARATION
We hereby declare that the entire work contained in this minor project report titled “Email
Classification Using RNN and Bidirectional LSTM layers ” has been carried out by
Samvardhan Singh [Reg No: RA2111043020005], Pramod Gururajan [Reg No:
RA2111043020006], Anumita R. Ajit [Reg No: RA2111043020008], Nikhil Shivanath
[Reg No: RA2111043020037] at SRM Institute of Science and Technology, Ramapuram
Campus, Chennai- 600089, under the guidance of Dr. Vani R, Professor, Department of
Computer Science and Engineering.
Place: Chennai
Date: 26 Oct 2024 Samvardhan Singh
Pramod Gururajan
Anumita R. Ajit
Nikhil Shivanath
ABSTRACT
We have introduced an AI-driven approach for classifying emails into 18 distinct categories
using a Recurrent Neural Network (RNN) architecture with Bidirectional Long Short-Term
Memory (BiLSTM) layers, combined with GloVe pre-trained word embeddings. The model
efficiently categorizes emails across a wide range of types, including promotional, educational,
confirmational, personal, transactional, newsletters, feedback/surveys, alerts/notifications,
invitations, apologies/corrections, job applications/career-related, collaboration/partnership,
legal/compliance, internal communications, holiday greetings, reminders, unsubscribe, and
welcome emails. By leveraging contextual information from both forward and backward
directions, the model achieves a high classification accuracy of 97%, outperforming existing
methods. This makes it highly suitable for integration into email management systems, offering
a more streamlined and automated approach to email organization, reducing manual effort, and
improving productivity.
iv
TABLE OF CONTENTS
Page.No
ABSTRACT iv
LIST OF FIGURES vi
1 INTRODUCTION 2
1.1 Introduction 2
1.2 Problem Statement 2
1.3 Objective of the Project 3
1.4 Project Domain 3
1.5 Scope of the Project 3
1.6 Methodology 4
2 LITERATURE REVIEW 5
3 PROJECT DESCRIPTION 7
3.1 Existing System 7
3.2 Proposed System 7
3.3 Feasibility Study 8
3.3.1 Economic Feasibility 9
3.3.2 Technical Feasibility 9
3.3.3 Social Feasibility 9
3.4 System Specification
3.4.1 Hardware Specification 10
3.4.2 Software Specification 10
4 PROPOSED WORK 11
4.1 Introduction .......................................................................................... 11
4.2 General Architecture ............................................................................ 13
4.3 Design Phase ........................................................................................ 15
4.3.1 Data Flow Diagram ............................................................... 15
4.3.2 UML Diagram ....................................................................... 16
4.3.3 Use Case Diagram ................................................................. 17
4.3.4 Sequence Diagram ................................................................ 18
4.4 Module Description .............................................................................. 19
4.4.1 Data Collection and Training Data 19
4.4.2 Step:1 Data collecting ........................................................... 19
4.4.3 Step:2 Processing of data ...................................................... 19
4.4.4 Step:3 Split the Data.............................................................. 20
4.4.5 Datasets Sample ..................................................................... 20
4.4.6 Step:4 Building the Model .................................................... 21
8 SOURCE CODE 30
8.1 Sample Code 30
References 32
LIST OF FIGURES
1
Chapter 1
INTRODUCTION
1.1 Introduction
This project falls within the domains of Natural Language Processing (NLP)
and Machine Learning, focusing specifically on email classification. It aims
to enhance automated systems for organizing and prioritizing emails in both
personal and professional communication settings.
3
1.6 Methodology
4
Chapter 2
LITERATURE REVIEW
PROJECT DESCRIPTION
• Economic Feasibility
• Technical Feasibility
• Social Feasibility
8
3.3.1 Economic Feasibility
9
3.4 System Specification
10
Chapter 4
PROPOSED WORK
4.1 Introduction
The study "Email Classification Using RNN and Bidirectional LSTM layers"
describes a novel AI-based technique for email classification that employs a
Recurrent Neural Network (RNN) architecture with Bidirectional Long Short-
Term Memory (BiLSTM) layers, supplemented with GloVe pre-trained word
embedding. The suggested algorithm classifies emails into 18 different
categories, ranging from promotional and transactional emails to legal notices
and internal communications, with an accuracy rate of 97%. By analyzing text
sequences in both forward and backward directions, the BiLSTM model
improves its knowledge of contextual and semantic subtleties in the text,
which is required for appropriately classifying various email kinds.
11
To prevent overfitting, the model architecture comprises embedding layers,
BiLSTM processing, and thick layers, as well as regularization techniques
like as dropout and batch normalization. It also uses an adaptive optimizer
(Adam) and categorical cross-entropy for multi-class classification. This
technique represents a substantial development in automated email
categorization, decreasing the human work required for email sorting and
increasing productivity by properly managing complicated email types in
both personal and commercial settings. Figure 4.1 represents the distribution
of email categories within the dataset.
12
4.2 General Architecture
Figure 4.2 represents the detailed architecture of the RNN model with
BiLSTM layers for email classification. The flowchart defines the following:
13
2. Preprocessing includes:
a. Padding: Standardizing input length
b. Feature Extraction
c. Tokenization: Converting preprocessed text into sequence of integers
3. Embedding Layer:
a. Uses GloVe pre-trained embeddings (100-dimensional vectors)
b. Converts tokenized text into word embeddings
4. BiLSTM Layers:
a. BiLSTM Layer 1: 128 units
b. BiLSTM Layer 2: 64 units
c. Processes text in both forward and backward directions
5. Model Components:
a. Dropout layer: For preventing overfitting (30% dropout rate)
b. Batch Normalization: For stabilizing training
c. Regularization: Additional measure against overfitting
6. Dense Layers:
a. Dense Layer: 64 units with ReLU activation
b. Output Layer: 18 units (corresponding to email categories) with
softmax activation
7. Final Steps:
a. Load Trained Model
b. Predict Email Category
c. Display Results
14
4.3 Design Phase
4.3.1 Data Flow Diagram
Figure 4.3 shows a flow diagram of the email classification system's process
flow. It illustrates:
1. The process starts with "User Input" at the top
2. Branches into two parallel paths:
a. Left path: Goes through Raw E-mail text → Preprocessing →
Make Plan Adjustments → Monitor Project Progress
b. Right path: Goes through Feature Extraction → TF-IDF Calculation →
Make Plan Adjustments → Monitor Project Progress
3. Both paths converge into the "Classification model"
4. Finally leads to "Predicated E-mail category" (note: there appears to be a
typo, should be "Predicted")
15
4.3.2 UML Diagram
Figure 4.4 shows the class diagram of the email classification system
architecture with two main components:
16
4.3.3 Use Case Diagram
Figure 4.5 illustrates the overall flow of the email classification system,
where the user inputs an email that undergoes preprocessing to clean and
prepare the text. The system then classifies the email using the RNN/BiLSTM
model, which can also be updated through training with new data to improve
performance. The user can view the classified category results and interact
with the model as needed.
17
4.3.4 Sequence Diagram
Figure 4.6 represents a step-by-step sequence, starting with the user's email
input. The email is cleaned, tokenized, padded for uniform length, and then
transformed using word embeddings. It is processed through the Bi-LSTM
model, and the output is passed through a dense layer and a softmax layer for
classification. The predicted email category is finally returned to the user.
18
4.4 Module Description
• Identify Data Sources: Collect emails from reliable sources, like publicly
available datasets (e.g., Enron email dataset, spam classification datasets)
or organizational email systems with permission.
• Category Selection: Define clear email categories (e.g., promotional,
transactional, personal) to guide data labeling.
• Data Labeling: Use manual or automated labeling methods to assign
emails to categories, ensuring high-quality labeled data.
• Data Diversity: Ensure the dataset represents various email types, topics,
and formats to improve the model’s robustness and generalization.
After the pre-processing part, the dataset was split into 80% for training and 20%
for testing. This split helps to train the model on the majority of the data while
reserving a portion for evaluating its performance on unseen samples, ensuring
better generalization. Figure 4.8 shows a sample of the split dataset.
• Data Preprocessing: Clean the email text data (remove special characters,
convert to lowercase, etc.), then tokenize and vectorize it using pre-trained
embeddings like GloVe.
• Model Architecture: Choose a recurrent neural network (RNN)
architecture with BiLSTM layers to handle the text's sequential nature. Add
an embedding layer, followed by stacked BiLSTM layers with dropout and
batch normalization to prevent overfitting.
• Output Layer: Set a dense layer with a softmax activation to classify into
multiple categories.
21
Chapter 5
5.1 Testing
Testing is the process of evaluating a system or its component(s) with the intent
to find whether it satisfies the specified requirements or not.
1. Performance Metrics: To comprehensively evaluate the model’s classification
accuracy, a combination of performance metrics is used, such as:
• Accuracy: Measures the proportion of total correct predictions, providing an
overall sense of model accuracy.
• Precision: Evaluates the ratio of true positive predictions to the total predicted
positives, ensuring the model’s relevance when identifying specific classes.
• Recall: Determines the proportion of true positives identified from the actual
positives, assessing the model's effectiveness in capturing all relevant
instances.
• F1-Score: Balances precision and recall, providing a harmonic mean that
highlights the model’s ability to maintain accuracy across diverse categories. A
high F1-score indicates robust, well-rounded classification performance,
particularly in scenarios with imbalanced datasets.
2. Confusion Matrix Analysis: Check the confusion matrix to understand
category-wise classification accuracy.
3. Fine-Tuning: Adjust and retrain the model if certain categories show low
performance to improve accuracy and generalization.
22
Figure 5.1: Confusion Matrix
For training, the dataset is divided with 80% used for training and 20% for
testing. The Adam optimizer, a commonly used variant of stochastic gradient
descent, is employed as it adapts the learning rate based on the first and second
moments of the gradients. The loss function us ed is categorical cross-entropy,
ideal for multi-class classification tasks like this, where more than two
categories exist. Early stopping is applied during training to halt the process
if the validation loss does not improve for three consecutive epochs,
preventing overfitting and ensuring the model does not learn noise from the
data. The model is trained for up to 15 epochs with a batch size of 32, where
each epoch represents a full pass through the training data, and the batch size
refers to the number of samples processed before the model updates its
weights. Figure 5.1 represents the confusion matrix to understand the class
wise accuracy of the model.
23
Figure 5.2: Model Training
Figures 5.2 and 5.3 display the training and validation curves for the model's
accuracy and loss. The model stops training after three consecutive epochs
with no substantial improvement in validation loss, which is done using early
stopping. The F1-score is a balanced measure of performance by taking the
harmonic mean of accuracy and recall. Recall indicates how many positive
examples were properly predicted by the model.
24
Table 5.1. Precision, Recall, F1 score and support of each category
27
Chapter 7
CONCLUSION AND FUTURE
ENHANCEMENTS
7.1 Conclusion
The suggested model, which uses a Recurrent Neural Network (RNN)
with Bidirectional Long Short-Term Memory (BiLSTM) layers and GloVe
embeddings, demonstrated excellent accuracy in categorizing emails into 18
separate categories while effectively handling both basic and complicated
email kinds. Using the BiLSTM architecture, the model gathered contextual
information from both previous and following terms in the text, increasing its
ability to distinguish meaning and context, particularly in delicate categories
such as legal/compliance and collaborative communications. With a
classification accuracy of 97%, the model outperformed standard approaches
like Naive Bayes and SVM, indicating its potential for real-world use in
automated email sorting and management. The model's resilience makes it
ideal for contexts where accurate email classification is critical for workflow
management.
Although the present model provides good performance, there are areas for
future enhancement to increase its versatility and accuracy:
1) Expansion of Email Categories: By including more detailed subcategories
inside current classes, such as dividing promotional emails into multiple
sorts (e.g., seasonal promotions, tailored offers), customers may receive
28
more specialized email classification.
2) Handling Mixed-Content Emails: Improving the model's ability to
recognize and categorize emails with several content kinds (for example, a
promotional email with legal disclaimers) will help it handle complicated
email structures better.
3) Sentiment Analysis: By incorporating sentiment analysis layers, the model
might give emotional context, such as urgency or purpose, which would be
useful in ranking emails.
4) Improving Generalization with Transfer Learning: Exploring pre-trained
big language models or applying transfer learning techniques may make the
model more adaptive to new or unknown email categories with minimum
retraining.
5) Real-Time Processing Optimization: Additional optimizations, such as
decreasing computing complexity or utilizing cloud-based processing,
would enable smooth, real-time email categorization, which is especially
useful in business situations with huge email volumes.
29
Chapter 8
SOURCE CODE
8.1 Sample Code
import tensorflow as tf
print(tf.__version__)
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
nltk.download('stopwords')
# Renaming columns
df.columns = ['email_type', 'content']
# Data preprocessing
def preprocess_text(text):
text = text.lower()
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'<.*?>', '', text) # Remove HTML tags
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove special characters (retain numbers)
text = ' '.join([word for word in text.split() if word not in stopwords.words('english')])
30
# Remove stopwords
return text
df['content'] = df['content'].apply(preprocess_text)
# Update labels
y = df['email_type'].values - 1 # Convert labels to 0-based indices
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim))
model.add(Bidirectional(LSTM(128)))
model.add(Dropout(0.5))
model.add(Dense(18, activation='softmax'))
31
References
Material Type Works Cited
32
[9] Zaizoune, Marouane, et al. "Automatic emails classification."
In 2023 10th International Conference on Wireless Networks
and Mobile Communications (WINCOM), pp. 1-4. IEEE, 2023.
[10] Saraswathi, N., et al. "Email Spam Classification and
Detection using Various Machine Learning Classifiers." In 2024
International Conference on Advances in Computing,
Communication and Applied Informatics (ACCAI), pp. 1-7.
IEEE, 2024.
[11] Pallavi, N., and P. Jayarekha. "Efficient Spam Email
Classification Using Machine Learning Algorithms." In 2023 7th
International Conference on Computation System and
Information Technology for Sustainable Solutions (CSITSS), pp.
1-6. IEEE, 2023.
[12] Ouyang, Qianhe, et al. "E-mail Spam Classification using
KNN and Naive Bayes." Highlights in Science, Engineering and
Technology, vol. 38, 2023, pp. 57-63.
[13] Cheng, Shaopeng. "Classification of Spam E-mail based on
Naïve Bayes Classification Model." Highlights in Science,
Engineering and Technology, vol. 39, 2023, pp. 749-753.
[14] Dangeti, Srinivasa Rao, et al. "Classification Analysis for e-
mail Spam using Machine Learning and Feed Forward Neural
Network Approaches." In International Conference on
Computational Innovations and Emerging Trends (ICCIET-
2024), pp. 66-75. Atlantis Press, 2024.
[16] Bari, Prince, et al. "SMS and E-mail Spam Classification
Using Natural Language Processing and Machine Learning." In
International Conference on Communication, Electronics and
Digital Technology, pp. 103-115. Singapore: Springer Nature
Singapore, 2023.
[17] Iddrisu, Wahab Abdul, et al. "Content-Based Spam
Classification of Academic E-mails: A Machine Learning
Approach." In Advances in Information Communication
Technology and Computing: Proceedings of AICTC 2022, pp.
83-92. Singapore: Springer Nature Singapore, 2023.
33
[1] Poobalan, A., K. Ganapriya, K. Kalaivani, and K. Parthiban. "A novel and secured email
classification using deep neural network with bidirectional long short-term memory." Computer
Speech & Language 89 (2025): 101667.J. Clerk Maxwell, A Treatise on Electricity and
Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
[2] De La Noval, Alejandro, Diana Gutierrez, Jayesh Soni, Himanshu Upadhyay, Alexander Perez-
Pons, and Leonel Lagos. "Methodologies for Email Spam Classification Using Large Language
Models." In 2023 International Conference on Computational Science and Computational
Intelligence (CSCI), pp. 179-185. IEEE, 2023.
[3] Alshawi, Bandar, Amr Munsh, Majid Alotaibi, Ryan Alturki, and Nasser Allheeib.
"Classification of SPAM mail utilizing machine learning and deep learning techniques."
International Journal on Information Technologies and Security 16, no. 2 (2024): 71-82R.
[4] Dai, Na, Brian D. Davison, and Xiaoguang Qi. "Looking into the past to better classify web
spam." In Proceedings of the 5th international workshop on adversarial information retrieval
on the web, pp. 1-8. 2009.
[5] Krishnamoorthy, Parthiban, Mithileysh Sathiyanarayanan, and Hugo Pedro Proença. "A novel
and secured email classification and emotion detection using hybrid deep neural network."
International Journal of Cognitive Computing in Engineering 5 (2024): 44-57.
[6] Savitha, G., K. Hithyshi, J. Harshitha, Dhanya Shree JN, and Pratiksha Soori. "Advanced Email
Spam Detection: A Machine Learning Solution." In 2023 International Conference on
Evolutionary Algorithms and Soft Computing Techniques (EASCT), pp. 1-5. IEEE, 2023.
[7] Harshini, B. V. "Enhancing Workflow Efficiency through Machine Learning-Based Email
Sorting."Journal of Soft Computing and Computational Intelligence (e-ISSN: 3048-6610) 1, no.
2 (2024): 13- 24.
[8] Xu, Jiacheng. "Automatic Classification and Analysis of Spam Based on Machine Learning." In
2023 International Conference on Industrial IoT, Big Data and Supply Chain (IIoTBDSC), pp.
28-32. IEEE Computer Society, 2023.
[9] Zaizoune, Marouane, Youssef Fakhri, and Siham Boulaknadel. "Automatic emails
classification." In 2023 10th International Conference on Wireless Networks and Mobile
Communications (WINCOM), pp. 1-4. IEEE, 2023.
[10] Saraswathi,N.,S.Pradeep,V.Sathiyavathi,K.Sabitha,andK.Rajesh Kambattan. "Email Spam
34
Classification and Detection using Various Machine Learning Classifiers." In 2024
International Conference on Advances in Computing, Communication and Applied Informatics
(ACCAI), pp. 1-7. IEEE, 2024.
[11] Pallavi, N., and P. Jayarekha. "Efficient Spam Email Classification Using Machine Learning
Algorithms." In 2023 7th International Conference on Computation System and Information
Technology for Sustainable Solutions (CSITSS), pp. 1-6. IEEE, 2023
[12] Ouyang, Qianhe, Jiahe Tian, and Jiale Wei. "E-mail Spam Classification using KNN and
Naive Bayes." Highlights in Science, Engineering and Technology 38 (2023): 57-63.
[13] Cheng, Shaopeng. "Classification of Spam E-mail based on Naïve Bayes Classification
Model." Highlights in Science, Engineering and Technology 39 (2023): 749-753.
[14] Dangeti,SrinivasaRao,DileepKumarKadali,YesujyothiYerramsetti, Ch Raja Rajeswari, D.
Venkata Naga Raju, and Srinath Ravuri. "Classification Analysis for e-mail Spam using
Machine Learning and
[15] Feed Forward Neural Network Approaches." In International Conference on Computational
Innovations and Emerging Trends (ICCIET-2024), pp. 66-75. Atlantis Press, 2024.
[16] Bari,Prince,VimalaMathew,SuchiPrabhuTandel,PadvariyaAniket, Kishor S. Chaudhari, and
Swapnali Naik. "SMS and E-mail Spam Classification Using Natural Language Processing and
Machine Learning." In International Conference on Communication, Electronics and Digital
Technology, pp. 103-115. Singapore: Springer Nature Singapore, 2023.
[17] Iddrisu, Wahab Abdul, Sylvester Kwasi Adjei-Gyabaa, and Isaac Akoto. "Content-Based
Spam Classification of Academic E-mails: A Machine Learning Approach." In Advances in
Information
[18] Communication Technology and Computing: Proceedings of AICTC 2022, pp. 83-92.
Singapore: Springer Nature Singapore, 2023.
[19] Douzi,Samira,FedaA.AlShahwan,MouadLemoudden,andBouabid El Ouahidi. "Hybrid email
spam detection model using artificial intelligence." International Journal of Machine Learning
and Computing 10, no. 2 (2020).
[20] Miao, Zhenghao. "Efficient Spam Classification Using Machine Learning Methods."
Highlights in Science, Engineering and Technology34 (2023): 60-64.Zili Luo, Farhana
Zulkernine, "An Intelligent Email Classification System Using CNN-BiLSTM."
[21] Miao, Zhenghao. "Efficient Spam Classification Using Machine Learning Methods."
Highlights in Science, Engineering and Technology 34 (2023): 60-64
[22] Luo,Zili,andFarhanaZulkernine."AnIntelligentEmailClassification System." In 2023 IEEE
Symposium Series on Computational Intelligence (SSCI), pp. 1126-1131. IEEE, 2023.
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59