0% found this document useful (0 votes)
13 views26 pages

Anti Spam

The project 'Email Spam Detection with Machine Learning' aims to develop a robust system for identifying and filtering spam emails using advanced machine learning techniques. The team will analyze spam characteristics, preprocess data, and create user-friendly applications to enhance email security and productivity. The project involves systematic phases from data preparation to model evaluation and deployment, ultimately contributing to improved cybersecurity measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views26 pages

Anti Spam

The project 'Email Spam Detection with Machine Learning' aims to develop a robust system for identifying and filtering spam emails using advanced machine learning techniques. The team will analyze spam characteristics, preprocess data, and create user-friendly applications to enhance email security and productivity. The project involves systematic phases from data preparation to model evaluation and deployment, ultimately contributing to improved cybersecurity measures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

NATIONAL RESEARCH UNIVERSITY HIGHER SCHOOL OF

ECONOMICS

Graduate School of Business

Course Name: Applied Data Science

Master’s Programme “Business Analytics and Big Data Systems”

Project Name: “Email Spam Detection with Machine Learning (Anti Spam)”

Submitted By:

Project Manager Maxim Egorov

Business Analyst Md Yeashin Arafat

Data Scientist Nishi Kant Chandra

Model Developer Li Shuai

Business Unit Rustom Kobilov


“AntiSpam”
"Email Spam Detection with Machine Learning "

Introduction
Email continues to be one of the most popular means of communication in the digital era, linking
people and businesses worldwide. Spam emails, however, continue to be a problem for this vital
medium. Spam emails are unsolicited communications that frequently have malicious intent,
such phishing, malware distribution, or the promotion of fraudulent schemes. In addition to
cluttering inboxes, these communications put users' security at serious risk and hinder their
productivity.

Even twenty years ago, when he proposed charging for emails to deter spam, Bill Gates took a
bold stand on this issue. This illustrates how complicated and enduring the issue has been.

Organizations and researchers have resorted to cutting-edge technology methods to tackle the
ever-increasing issue of spam. In this field, machine learning (ML) has become a potent tool,
providing reliable techniques for efficiently identifying and filtering spam emails. ML models
can examine trends, spot abnormalities, and accurately categorize emails as spam or legitimate
by utilizing data-driven algorithms.

By creating and deploying an advanced machine learning-based spam detection system, our
project, "Email Spam Detection with Machine Learning," seeks to address this pressing
problem. The project is set up to methodically investigate the issue of email spam, create a
solution utilizing cutting-edge machine learning techniques, and assess its efficacy in practical
applications.
Project part

Relevance
- Key Objectives of the Project
1. Understand the Problem: Delve into the nature and characteristics of spam emails to
comprehend their impact on users and organizations.
2. Leverage Machine Learning: Utilize advanced ML techniques to develop a predictive
model capable of accurately identifying spam emails.
3. Create Practical Applications: Design applications such as a mobile app or an Outlook
extension to integrate the spam detection system into everyday email usage.
4. Deliver Business Value: Provide actionable insights and recommendations to enhance
email security and improve user experiences.

Relevance and Applications

Email spam detection is important for governments and organizations in addition to


individual consumers. By stopping malware infections and phishing attempts, efficient
spam detection systems can:

● Improve cybersecurity by preventing phishing attacks and malware infections.


● Reduce the amount of time spent responding to unwanted emails to increase
productivity.
● Avoid allowing dishonest email tactics to jeopardize private information.

Approach and Methodology

This project employs a systematic approach combining technical rigor with practical
applications. It begins with a comprehensive understanding of the problem and the relevant data,
followed by data preprocessing and exploratory data analysis. ML models are trained and
validated to achieve high levels of accuracy in spam detection. The project will culminate in
developing user-friendly applications for real-world deployment.
Project Scope and Team Contributions

Our team consists of experts in business analysis, project management, and technical
implementation, each contributing to the success of the project. As part of the initial phase,
Jimmy has undertaken the responsibility of drafting this introduction and describing anti-spam
mechanisms. This section aims to provide a foundation for understanding the project's
importance, scope, and technical underpinnings.

By addressing the challenge of email spam detection through machine learning, this project
aspires to make a meaningful contribution to the field of cybersecurity and digital
communication. The subsequent sections will delve deeper into the technical and business
aspects, laying the groundwork for a comprehensive and impactful solution.

- Team description
Project Manager(Maxim Egorov)
As a project manager, he needs to oversee the project's execution, ensuring deadlines and
deliverables are met. At the same time, he should manage resources, schedule meetings, and
resolve roadblocks.

Business Analyst(Md Yeashin Arafat)


Business analyst should define project requirements and bridge communication between
technical and business team. He is responsible for the documents business needs, validating
project goals and ensuring alignment with stakeholders.

Data Scientist(Nishi Kant Chandra)


The data scientist needs to prepare and preprocess data for model training. He needs to clean
data and ensure the data quality.
Model Developer(Li Shuai)
A model developer is capable of developing and refining machine learning models. The model
developer should ensure models are accurate, efficient, and usable, whether for machine
learning, financial forecasting, or engineering simulations.

Business Unit(Rustom Kobilov)


A business unit is a distinct division within an organization that focuses on a specific set of
products, services, or markets, operating with some level of autonomy. He needs to develop
strategies, manage operations, and drive revenue to achieve its objectives while aligning with the
company's overall goals.

- Work plan
Introduction: A work plan for the Antispam project outlines the objective of developing a
system to detect and block spam effectively. It specifies the scope, key tasks such as research,
development, and testing, and assigns responsibilities to team members. The plan includes a
timeline, necessary resources, and deliverables like a functional system and performance reports.

Phase 1: Project initiation and planning


Phase 2: Data preparation and analysis
Phase 3: Model development and validation
Phase 4: Application development and deployment
Phase 5: Results, business recommendations and future enhancements
Business part
Business goals : Create an automated system that allows companies to determine whether
incoming emails/messages are spam or important customer information. This can reduce the cost
of handling unwanted messages and improve customer service efficiency.
Objectives:
1.Build a machine learning model to classify messages into “spam” and “ham”.
2. Evaluate the quality of the data and pre-process it.
3. Train the model on the provided data, ensuring high accuracy and minimizing false positives
and false negatives.
4. Evaluate model metrics (accuracy, completeness, F1-score) and recommend a way to integrate
the model into the real system.
Technical part

Methods in general

Methods for Email Spam Detection

To tackle the issue of email spam effectively, our project employs a range of machine learning
techniques and methodologies grounded in data-driven approaches. By leveraging these
methods, we aim to design a robust system capable of accurately identifying and filtering spam
emails. The methods are as follows:

1. Data Preprocessing

Before applying machine learning models, the raw email dataset undergoes preprocessing to
ensure its suitability for analysis and modeling. This step includes:

● Data Cleaning: Removing duplicate entries, addressing missing values, and discarding
irrelevant information.
● Tokenization: Splitting email text into individual tokens (words or phrases) for further
analysis.
● Stopword Removal: Eliminating common but uninformative words (e.g., "the," "and,"
"is") that do not contribute to spam classification.
● Stemming and Lemmatization: Reducing words to their base or root forms to
standardize the text data.
● Label Encoding: Assigning numerical labels to categorical data, such as spam (1) and
non-spam (0).

2. Feature Extraction

Feature extraction involves converting email text into numerical representations that machine
learning algorithms can process. Common methods include:

● Bag of Words (BoW): Representing email content as a matrix of word counts or


frequencies.
● Term Frequency-Inverse Document Frequency (TF-IDF): Calculating the importance
of words in an email relative to a collection of emails.
● N-grams: Capturing sequences of words (e.g., bigrams, trigrams) to account for
contextual relationships.
● Header and Metadata Analysis: Extracting features from email headers, such as sender
information, subject lines, and timestamps.

3. Machine Learning Models

To classify emails as spam or non-spam, we employ various machine learning algorithms. The
models are selected based on their performance, interpretability, and computational efficiency:

● Logistic Regression: A simple and interpretable algorithm suitable for binary


classification tasks.
● Naïve Bayes: Particularly effective for text data due to its probabilistic approach and
assumption of feature independence.
● Support Vector Machines (SVM): A robust classifier that finds the optimal hyperplane
to separate spam and non-spam emails.
● Random Forest: An ensemble method that combines multiple decision trees to improve
classification accuracy and reduce overfitting.
● Gradient Boosting (e.g., XGBoost, LightGBM): Advanced ensemble techniques that
iteratively refine predictions by minimizing errors.
● Neural Networks: Deep learning models, such as recurrent neural networks (RNNs) or
transformers, for capturing complex patterns in email text.

4. Model Evaluation

To assess the performance of the selected models, we use standard evaluation metrics:

● Accuracy: The proportion of correctly classified emails.


● Precision: The percentage of emails classified as spam that are truly spam.
● Recall (Sensitivity): The percentage of actual spam emails correctly identified.
● F1 Score: The harmonic mean of precision and recall, balancing false positives and false
negatives.
● Confusion Matrix: A summary of classification outcomes, including true positives, false
positives, true negatives, and false negatives.
● ROC-AUC Curve: Measuring the model's ability to distinguish between spam and
non-spam emails across various thresholds.

5. Hyperparameter Tuning

To optimize the performance of the models, hyperparameter tuning is conducted using


techniques such as grid search or random search. This step involves adjusting parameters like
learning rates, tree depths, and regularization terms to achieve the best results.

6. Deployment and Integration

After selecting the best-performing model, the spam detection system will be integrated into
practical applications, such as:

● Mobile Applications: A user-friendly app for detecting spam emails on smartphones.


● Outlook Extensions: A plugin for Microsoft Outlook to automatically filter spam emails.
● Cloud-Based Services: An API to provide spam detection functionality for third-party
email clients and services.

By systematically implementing these methods, we aim to develop an effective and reliable


solution for email spam detection, contributing to enhanced cybersecurity and user productivity.

Data Cleaning Process and Results

Before performing data cleaning, the dataset contained 5574 rows and 2 columns:

1. type: The label of the message, indicating whether it is a regular message (ham) or spam
(spam).
2. text: The content of the message.

Initial Observations:
● No missing values: Both columns are fully populated, with no empty cells.
● Potential duplicates: The dataset might contain repeated entries, which could lead to biased
model training.
● Imbalance: The dataset likely has an imbalance between the ham and spam categories, as
non-spam messages are generally more common in real-world data.

Here’s a quick overview of the first few rows of the dataset:

Steps in the Data Cleaning Process

Below are the steps performed to prepare the data for further analysis:

1. Removing Duplicates

Duplicate rows can inflate certain classes (like spam) or create noise in the dataset. Using pandas'
.drop_duplicates()method:

● Initial row count: 5574.


● Duplicates removed: 414.
● Final row count: 5160.

2. Trimming Whitespace

The text column was checked for leading and trailing whitespaces. Even though these might not
seem significant, they can affect text preprocessing (e.g., tokenization).

We used the .str.strip() function to clean up the text.


3. Converting Labels to Numeric

For machine learning purposes, categorical labels need to be converted into numeric format. The
type column was mapped as:

● ham → 0.
● spam → 1.

4. Results After Data Cleaning

1. Final Dataset Shape:


○ The cleaned dataset contains 5160 rows and 2 columns.
2. Label Distribution:
○ Non-spam messages (ham): 4518 (approximately 87.5% of the data).
○ Spam messages (spam): 642 (approximately 12.5% of the data).
3. Key Observations:
○ The dataset is imbalanced, with a significantly higher number of non-spam
messages compared to spam. This imbalance should be addressed during model
training using techniques such as oversampling, undersampling, or using
weighted loss functions.

Summary

The data cleaning process ensured that the dataset was free from duplicates, inconsistencies, and
irrelevant whitespace. The labels were standardized for efficient processing, and the distribution
of spam vs. non-spam messages was identified for further consideration during model training.

These steps are crucial to improve the reliability and accuracy of the spam detection model. The
cleaned dataset is now ready for exploratory data analysis (EDA) and model development.
Exploratory Data Analysis (EDA):

For Data Analysis, Exploratory Data Analysis or EDA is crucial steps that involves exploring
data, analyzing data and visualizing data. Below, we are describe the step by step EDA process.

A. Import Libraries

For data pre processing we use different libraries for different uses. Like, we work with
Numpy and and Pandas for data processing and data wrangling. For visulization, we used
Matplotlib asd Seaborn tools. For our machine learning techniques, we use SKlearn
which is popular ML libraries in the python environment. Here is the our work for
importing libraries:
B. Reading The Dataset:

In the python, there are several ways to read the data set like, Json, csv, .xlsx, .sql. .txt,
images etc. We use the csv file for our project analysis.

C. Understanding the Analysis

Now, we are going to analze our data based on data set. At first, we need to check the basic info,
after that we are going to anlyze our data at in depth.

df.info()
From, the we saw that there are 02 columns and 5,574 entries for our analysis the spam project
from dataset.

df.head()

In that code, we saw that there are two types of categorical data. First one is ‘ham’ and the
second is ‘spam’. And, there are some text emails related to our categorical data.
From the dataset, we are find that, there are 5,574 rows and 2 column in our data set. And there
are 414 duplicate rows which are need to be cleaned later.

From, df.isnull().sum() code, we are trying to find find null value. And we didn’t find any null
value from our data set. But in the data set, there are some duplicates value. In our data set
‘ham’ email contains 4827 text out of 5574 email. That’s mean we have little amount of spam
email. And, “Sorry, I’will call later” this sentence 30 times in our data set. Furthermore, our
almost of the text is unique.
D. Data Wrangling

In the data wrangling, we develop our data-set in our work process way.

We create the data-set into two categories such as “Category” and “Message”. And, we add one
column name “spam” to detect our spam email.
E. Data Visulization

For data visulization, we used first pie chart visualization for comparing distribution of
spam and Ham emails.

From the pie chart, we got 86.60% ham email and 13.40% spam emails in our data set.
Secondly, We also find the most used word in our spam message through wordcloud plot
visulization code.

At first, we separate spam email from our data set coulum. And we put code to create
wordcloud chart. Below we present our chart.
From the above wordcloud plot, we saw some word are mostly used for spam message
such as 'free', 'call', 'text', 'txt' and 'now'.

F. Preparing Data in the PowerBi

PowerBI one of the powerful Business Intelent tools for using the data cleaning,
modeling and data visualizaion. At first, we used query to setup appropriate data for our
to crate simple DashBoard.

We renamed our column to the “Categories of email”and “Email Descriptio”. And, we


remove some rows and column from our dataset. We filtered some rows and deleted
errors data from our data set. After transform our data, we created a simple Dash-Board
based on our dataset.
Methods ML model

The dataset was split into training and testing sets using the train_test_split function from the
sklearn.model_selection module. The Message column containing the email text was used as the
input feature, and the Spam column, indicating whether the email is spam or not, served as the
target label. The data was split with 75% allocated to training and 25% to testing, ensuring robust
evaluation of the machine learning model.

The spam detection system was built using a Multinomial Naive Bayes model, which is
particularly suited for text classification tasks. The model achieved high accuracy and recall,
indicating its suitability for spam detection.

Performance Results:
The Multinomial Naive Bayes model demonstrated high performance. Training Set: ROC-AUC
score of 0.983, with 99.35% accuracy and a weighted F1-score of 99.35%. Testing Set:
ROC-AUC score of 0.957, with 98.49% accuracy and a weighted F1-score of 98.48%. The ROC
curve visualization highlighted the excellent separation between positive (spam) and negative
(ham) classes for both training and testing datasets.
Graph. ROC Curve

Primary Evaluation Metric:


Recall was chosen as the primary evaluation metric to minimize false negatives, ensuring that
spam emails are correctly identified. This choice aligns with the project’s objective to provide a
robust spam detection system that effectively filters out spam while minimizing the risk of
missed detections.

This structured approach, combining robust data preprocessing, model implementation, and
evaluation, ensures that the spam detection system is both accurate and interpretable, aligning
well with the project's goals.
Project results

Discussion of the Result

After using the Machine Learning model in the Python and Data preparation in PowerBI, we are making
our project result for Spam email detection. And, we are successfully detect spam email from our data set
and we also interpret our analysis in the PowerBI dash-board.

A. In the python ML model, we successfully identify the the spam email for project result.

In that code, “Free Tickets for IPL” sample email detected as spam email. The code give us result
in like way “This is a spam Email”. So, we detect spam email through our ML model.

B. In PowerBI we create simple Dash-Board to present our work.


In that, Dash-Board, We used Donat-chart and table to analyze the whole of the project in simple
way. In the Donat-Chart, the Spam email contains 13.41% of 100% and the amount of spam
emails is 750 out of 5574 emails. In the tabel, we are able simply sort the data based on “ham”
and “spam”.

Business Recommendation

This project has demonstrate that machine learning model successfully detect spam emails. This detection
will help us to minimize the spam email in our business project. We are going to be develop a mobile
application where email will be detect directly through the machine learning algorithm, which will help us
to save our time and money. Because of the spam email, people lost their money and time. Our mobile
apps will give the peoples to using their email safely.
Conclusion

In the world, email communication played a significant rule. So, betteling with spam email is very
important. In our dataset, 13 percent of emails are spam email, which is very alarming. That is why,
Detecting spam email is crucial for personal and business life. So, our project is to develop a robust email
spam detector based on Machine Learning model. We wanted to equip our detector to identify or separate
the ‘‘ham” and “spam” emails for our user. During our project, we identify some common words for using
spam emails such as 'free', 'call', 'text', 'txt' and 'now'. Nevertheless, our project target to
implement spam detection systems to minimize the spam email in our email box. We are looking
to safe our email inboxes from spam emails. We are looking forward to our project improvement
and innovative through machine learning algorithm. So, let’s say, “Save our emails from Spam”.
References
1. ResearchGate. (2023). Email Spam Detection using Machine Learning Techniques.
Retrieved from
https://www.researchgate.net/publication/367918824_Email_Spam_Detection_using_Ma
chine_Learning_Techniques.
2. Symantec Corporation. (2019). The State of Email Security. Retrieved from [Symantec
archives].
3. Gupta, S., & Nath, R. (2019). Machine Learning Applications in Spam Detection. Journal
of Cybersecurity.
4. EDA | Exploratory Data Analysis in Python

( https://www.geeksforgeeks.org/exploratory-data-analysis-in-python/)

5. Step-by-Step Exploratory Data Analysis (EDA) using Python

(https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-usin
g-python/#:~:text=Exploratory%20data%20analysis%20(EDA)%20is%20a%20critical%20initial
%20step%20in,trends%2C%20patterns%2C%20and%20relationships.)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy