EMAIL SPAM FINAL (2)
EMAIL SPAM FINAL (2)
MACHINE LEARNING
A PROJECT REPORT
Submitted by
Pavethra M (621522205037)
Pavithra S (621522205038)
Dhushara S (621522205015)
Dhanya R (621520205013)
of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
MAY 2025
i
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Certified that this Naan Mudhalvan & TNDSC Skill Development course of project report
“EMAIL/SMS SPAM DETECTION USING MACHINE LEARNING ” is the
bonafide work done by
Pavethra M (621522205037)
Pavithra S (621522205038)
Dhushara S (621522205015)
Dhanya R (621522205013)
SIGNATURE SIGNATURE
Dr. T. AKILA, M.E, Ph.D., Dr. T. AKILA, M.E, Ph.D.,
ASSOCIATE PROFESSOR, ASSOCIATE PROFESSOR,
HEAD OF THE DEPARTMENT, SUPERVISIOR,
Department of Information Technology, Department of Information Technology,
Mahendra College of Engineering, Mahendra College of Engineering,
Minnampalli, Salem-636106. Minnampalli, Salem-636106.
ii
ACKNOWLEDGEMENT
The Success and final outcome of this project required a lot of guidance and assistance
from many people and an extremely fortunate to have got this all along with the completion
of my project work.
We owe my profound gratitude to our guide, Dr. T.AKILA , Head of the Department
of Information Technology who took an interest in our projectwork and provided all the
necessary information for developing the project successfully. We also thank all the staff
members of our college and technicians for their help in making this project a successful one.
Lastly, we would like to thank the almighty and my parents for their moral support and
my friends with whom shared my day-to-day experience and received lots of suggestions that
improved my quality of work.
iii
ABSTRACT
Spam refers to any email that contains an advertisement, unrelated and frequent
emails. These emails are increasing day by day in numbers. Studies show that around
55 percent of all emails are some kind of spam. A lot of effort is being put into this
by service providers. Spam is evolving by changing the obvious markers of detection.
Moreover, the spam detection of service providers can never be aggressive with
classification because it may cause potential information loss to incase of a
misclassification.
To tackle this problem we present a new and efficient method to detect spam
using machine learning and natural language processing. A tool that can detect and
classify spam. In addition to that, it also provides information regarding the text
provided in a quick view format for user convenience.
iv
TABLE OF CONTENTS
ABSTRACT Iv
1 INTRODUCTION 1
1.1 OVERVIEW 1
2 LITERARTUTE REVIEW 3
v
2.6 EMAIL SPAM FILTERING SYSTEMS 4
3 SYSTEM ANALYSIS 5
6
3.2 PROPOSED SYSTEM
4 SYSTEM REQUIREMENTS 7
6 UML DIAGRAMS 12
7 PERFORMANCE ANALYSIS 15
7.3 METHODOLOGY 17
8 CONCULSION 18
9 FUTURE ENHANCEMENT 19
vii
10 APPENDIX 20
11 BIBLIOGRAPHY 23
viii
CHAPTER-1
INTRODUCTION
1.1 OVERVIEW
The scope of the Email/SMS Spam Detection Using Machine Learning project is to
develop a robust and intelligent system capable of automatically identifying and filtering
spam messages in both email and SMS communications. This system is designed to aid
users—including individuals, enterprises, and digital communication platforms—by
reducing exposure to unsolicited, malicious, or fraudulent content. The application leverages
a combination of natural language processing (NLP), machine learning algorithms, and real-
time data processing to classify messages with high accuracy.
While the project focuses on the intelligent detection and classification of spam
messages, it does not encompass direct integration with existing email/SMS providers or
mobile network operators, nor does it include features for data encryption, cybersecurity
measures, or phishing site blocking. The primary goal is to demonstrate the effectiveness of
machine learning in spam detection and provide a foundation for future development and
integration into larger communication security frameworks.
1
1.3 OBJECTIVE OF THE PROJECT
In this study, the Email/SMS Spam Detection Using Machine Learning project aims
to empower users by providing them with a reliable, intelligent system that can
automatically detect and filter spam messages from genuine communication. By integrating
key components such as natural language processing, text classification, feature extraction
techniques, and machine learning algorithms into a single, user-friendly application, the
project seeks to simplify the process of identifying unwanted or malicious content and
enhance the overall communication experience.
This system is designed to improve digital communication security and efficiency by
offering accurate spam classification, timely alerts, and real-time message analysis. Through
predictive models and continual learning, it helps users avoid phishing attempts, reduce
distractions from unsolicited messages, and ensure that only relevant and safe content
reaches their inbox. Ultimately, the platform bridges the gap between traditional rule-based
spam filters and modern AI-powered detection, enabling smarter, faster, and more adaptable
spam management for both personal and organizational use.
2
CHAPTER-2
LITERATURE REVIEW
3
2.4 TITLE: FEATURE EXTRACTION METHODS (TF-IDF, BAG OF WORDS)
Authors:Dr.SnehaJoshi
This review examines the most widely used feature extraction methods in spam
detection systems, namely TF-IDF (Term Frequency–Inverse Document Frequency) and
Bag of Words. It evaluates how these techniques convert raw text into numerical
representations that can be processed by machine learning algorithms and compares their
strengths, limitations, and use cases.
4
CHAPTER-3
SYSTEM ANALYSIS
In the traditional spam detection systems used in email and SMS platforms, the
approach largely relies on rule-based filters and keyword matching. These systems operate
by scanning incoming messages for specific terms or phrases commonly associated with
spam and then flagging or filtering such messages accordingly. While this method has
served as a foundational defense against unwanted communications, it is increasingly
inadequate in the face of evolving spam techniques and the dynamic nature of digital
communication.
5
3.2 PROPOSED SYSTEM
The proposed system introduces an intelligent, machine learning-based approach to
automatically detect and filter spam messages in both email and SMS communications.
This system is designed to overcome the limitations of traditional rule-based filters by
leveraging data-driven algorithms that can learn, adapt, and improve over time. It aims to
offer a more accurate, context-aware, and scalable solution for identifying unwanted or
malicious messages.
3.2.1.1 ADVANTAGE OF PROPOSED SYSTEM
1. Improved Accuracy Through Machine Learning
The system is designed to handle different types of message formats, including the
short, informal language common in SMS. This makes it versatile and applicable across
multiple communication platforms and use cases.
6
CHAPTER-4
SYSTEM REQUIREMENTS
5.1 ML ARCHITECTUREOVERVIEW
8
5.3 DATA COLLECTION
Data was collected from publicly available sources such as the UCI SMS Spam
Collection, Enron Email Dataset, and Kaggle repositories. These datasets include labeled
spam and non-spam messages from diverse formats. All data was anonymized to ensure
privacy compliance and balanced using synthetic sampling techniques where necessary.
Collected messages were cleaned and prepared using text preprocessing techniques
such as lowercasing, punctuation removal, tokenization, stop-word removal, stemming,
and lemmatization. The processed text was converted into numerical vectors using TF-
IDF or Bag of Words, making it suitable for machine learning model input.
A portion of the dataset was reserved for testing. It underwent the same
preprocessing as the training data. Model performance was evaluated using metrics like
accuracy, precision, recall, and F1-score to ensure effective generalization to new
messages.
Several algorithms were tested, including Naive Bayes, SVM, Random Forest, and
Logistic Regression. Deep learning models like CNN and LSTM were also considered.
The best-performing model, based on evaluation metrics, was selected for deployment.
9
5.6.1 ALGORITHM USED
Support Vector Machine is a powerful supervised learning algorithm used for binary
10
classification. It identifies the optimal hyperplane that best separates spam from non-spam
messages in the feature space. SVM is particularly useful when dealing with high-
dimensional data like TF-IDF vectors and is known for its high accuracy and robustness in
spam filtering.
Neural Networks, including architectures like CNN and LSTM, are used to model
complex and contextual patterns in text data. These deep learning models automatically
learn representations from raw input without manual feature engineering. While they
require more computational resources and training time, they can offer superior accuracy,
especially with large and diverse datasets.
Each algorithm was trained and tested on the same pre-processed dataset. Their
performance was evaluated using metrics such as accuracy, precision, recall, F1-score, and
ROC-AUC.
SVM provided high precision and was effective in minimizing false positives.
Based on this comparison, the model best suited for deployment was selected considering
both performance and implementation feasibility.
11
CHAPTER-6
UML DIAGRAMS
12
6.2 USE CASE DIAGRAM
13
6.3 ACTIVITY DIAGRAM
14
CHAPTER-7
PERFORMANCE ANALYSIS
The dataset used for training and evaluating the spam detection model consists of:
Features:
o Labels (Spam/Ham)
Preprocessing Steps:
Dataset Statistics:
Class Count
Spam 1,000
Ham 4,000
15
7.2 ACCURACY COMPARISON OF ALGORITHMS
Different Machine Learning algorithms were tested for spam detection, and
their performance was compared using metrics like Accuracy, Precision, Recall, and F1-
Score.
Comparison Table:
Observations:
Naive Bayes is fast but less accurate for complex spam patterns.
Logistic Regression provides a good balance between speed and accuracy.
Random Forest & SVM perform well but may overfit on small datasets.
LSTM (Deep Learning) achieves the highest accuracy but requires more
computational power.
16
7.3 METHODOLOGY
The spam detection system was developed using the following steps:
1. Data Collection & Preprocessing
Gathered labeled datasets (spam/ham).
Cleaned text data (removed noise, normalized text).
Applied TF-IDF/Word Embeddings for feature extraction.
2. Model Training
Split data into 70% training, 30% testing.
Trained multiple ML models (Naive Bayes, SVM, Random Forest, etc.).
Implemented LSTM for deep learning-based classification.
3. Evaluation Metrics
Used Confusion Matrix, Precision, Recall, F1-Score.
Compared model performance to select the best one.
4. Deployment (Optional)
Integrated the best model into a Flask/Django web app or Android app.
Enabled real-time spam detection for emails/SMS.
Conclusion
Best Performing Model: LSTM (Deep Learning) with 96% accuracy.
Best Trade-off Model: Random Forest (95.1% accuracy, less resource-
intensive).
Future improvements could include BERT-based models for better
contextual understanding.
This analysis confirms that machine learning effectively detects spam, with deep
learning providing the highest accuracy at the cost of higher computational requirements.
17
CHAPTER-8
CONCULSION
The primary goal of this project was to develop an intelligent, automated system
capable of detecting and classifying email and SMS messages as spam or legitimate (ham)
using machine learning techniques. Through the integration of text preprocessing methods,
feature extraction strategies such as TF-IDF and Bag of Words, and classification algorithms
including Naive Bayes, Random Forest, SVM, and Neural Networks, the system successfully
demonstrated its ability to accurately identify unwanted or harmful messages.
The implementation of Natural Language Processing (NLP) techniques significantly
enhanced the system’s ability to interpret message content, while supervised learning
algorithms enabled precise spam detection based on historical data. Among the evaluated
models, Neural Networks and Support Vector Machines yielded the highest accuracy, though
Naive Bayes offered superior performance in terms of speed and computational efficiency.
This project not only underscores the potential of machine learning in combating spam
across digital communication channels but also highlights the importance of continual model
training and data updates to stay ahead of evolving spam tactics. The system was designed to
be scalable, adaptable, and user-friendly, making it applicable in real-world scenarios ranging
from individual users to enterprise communication platforms.
In conclusion, the machine learning-based spam detection system provides a reliable
solution to a widespread problem, improving user experience, enhancing communication
security, and reducing the risk of phishing and other cyber threats. With further improvements
and integration into live platforms, this solution can be a vital component in maintaining safe
and efficient digital communication.
18
CHAPTER-9
FUTURE ENHANCEMENT
The current implementation of the Email/SMS Spam Detection system lays a strong
foundation for automated spam classification using machine learning. However, to address
the growing complexity of spam tactics and ensure broader applicability, several future
enhancements can be considered:
1. Advanced Deep Learning Integration
Future versions can incorporate state-of-the-art deep learning models such as LSTM,
GRU, and transformer-based architectures (e.g., BERT) for improved context
understanding and semantic analysis of messages.
2. Real-Time Filtering
Implementing real-time spam detection with minimal latency will allow seamless
integration into live messaging platforms and communication gateways, enhancing user
protection instantly.
3. Multilingual Support
Extending the model to support messages in regional and international languages
will broaden its usability and effectiveness in diverse linguistic environments.
4. Adaptive Learning with User Feedback
Enabling a continuous learning loop where the model updates based on user
feedback will help improve accuracy and reduce misclassification over time.
5. Phishing and Threat Detection
Expanding the system to detect phishing links, scams, and malware-infected
messages will provide users with an additional layer of security.
6. Browser and Mobile App Deployment
Creating lightweight, user-friendly browser extensions or mobile applications will
improve accessibility, allowing end-users to benefit from spam detection on the go.
19
CHAPTER-10
APPENDIX
App.py
import streamlit as st
import pickle
import string
from nltk.corpus import stopwords
import nltk
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def transform_text(text):
text = text.lower()
text = nltk.word_tokenize(text)
y = []
for i in text:
if i.isalnum():
y.append(i)
text = y[:]
y.clear()
for i in text:
20
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)
text = y[:]
y.clear()
for i in text:
y.append(ps.stem(i))
return " ".join(y)
tfidf = pickle.load(open('vectorizer.pkl','rb'))
model = pickle.load(open('model.pkl','rb'))
if st.button('Predict'):
# 1. preprocess
transformed_sms = transform_text(input_sms)
# 2. vectorize
vector_input = tfidf.transform([transformed_sms])
# 3. predict
result = model.predict(vector_input)[0]
# 4. Display
if result == 1:
st.header("Spam")
else:
st.header("Not Spam")
21
10.2 OUTPUT SCREENS
22
CHAPTER-11
BIBLIOGRAPHY
REFERENCES
23
6. S. O.Olatunji, "Extreme Learning Machines and Support Vector Machines models for
email spam detection," in IEEE 30th Canadian Conference on Electrical and Computer
Engineering (CCECE), 2017.
24