0% found this document useful (0 votes)

13 views26 pages

Anti Spam

The project 'Email Spam Detection with Machine Learning' aims to develop a robust system for identifying and filtering spam emails using advanced machine learning techniques. The team will analyze spam characteristics, preprocess data, and create user-friendly applications to enhance email security and productivity. The project involves systematic phases from data preparation to model evaluation and deployment, ultimately contributing to improved cybersecurity measures.

Uploaded by

MD Yeashin Arafat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views26 pages

Anti Spam

Uploaded by

MD Yeashin Arafat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

NATIONAL RESEARCH UNIVERSITY HIGHER SCHOOL OF

ECONOMICS

Graduate School of Business

Course Name: Applied Data Science

Master’s Programme “Business Analytics and Big Data Systems”

Project Name: “Email Spam Detection with Machine Learning (Anti Spam)”

Submitted By:

Project Manager Maxim Egorov

Business Analyst Md Yeashin Arafat

Data Scientist Nishi Kant Chandra

Model Developer Li Shuai

Business Unit Rustom Kobilov

“AntiSpam”
"Email Spam Detection with Machine Learning "

Introduction
Email continues to be one of the most popular means of communication in the digital era, linking
people and businesses worldwide. Spam emails, however, continue to be a problem for this vital
medium. Spam emails are unsolicited communications that frequently have malicious intent,
such phishing, malware distribution, or the promotion of fraudulent schemes. In addition to
cluttering inboxes, these communications put users' security at serious risk and hinder their
productivity.

Even twenty years ago, when he proposed charging for emails to deter spam, Bill Gates took a
bold stand on this issue. This illustrates how complicated and enduring the issue has been.

Organizations and researchers have resorted to cutting-edge technology methods to tackle the
ever-increasing issue of spam. In this field, machine learning (ML) has become a potent tool,
providing reliable techniques for efficiently identifying and filtering spam emails. ML models
can examine trends, spot abnormalities, and accurately categorize emails as spam or legitimate
by utilizing data-driven algorithms.

By creating and deploying an advanced machine learning-based spam detection system, our
project, "Email Spam Detection with Machine Learning," seeks to address this pressing
problem. The project is set up to methodically investigate the issue of email spam, create a
solution utilizing cutting-edge machine learning techniques, and assess its efficacy in practical
applications.
Project part

Relevance
- Key Objectives of the Project
1. Understand the Problem: Delve into the nature and characteristics of spam emails to
comprehend their impact on users and organizations.
2. Leverage Machine Learning: Utilize advanced ML techniques to develop a predictive
model capable of accurately identifying spam emails.
3. Create Practical Applications: Design applications such as a mobile app or an Outlook
extension to integrate the spam detection system into everyday email usage.
4. Deliver Business Value: Provide actionable insights and recommendations to enhance
email security and improve user experiences.

Relevance and Applications

Email spam detection is important for governments and organizations in addition to

individual consumers. By stopping malware infections and phishing attempts, efficient
spam detection systems can:

● Improve cybersecurity by preventing phishing attacks and malware infections.

● Reduce the amount of time spent responding to unwanted emails to increase
productivity.
● Avoid allowing dishonest email tactics to jeopardize private information.

Approach and Methodology

This project employs a systematic approach combining technical rigor with practical
applications. It begins with a comprehensive understanding of the problem and the relevant data,
followed by data preprocessing and exploratory data analysis. ML models are trained and
validated to achieve high levels of accuracy in spam detection. The project will culminate in
developing user-friendly applications for real-world deployment.
Project Scope and Team Contributions

Our team consists of experts in business analysis, project management, and technical
implementation, each contributing to the success of the project. As part of the initial phase,
Jimmy has undertaken the responsibility of drafting this introduction and describing anti-spam
mechanisms. This section aims to provide a foundation for understanding the project's
importance, scope, and technical underpinnings.

By addressing the challenge of email spam detection through machine learning, this project
aspires to make a meaningful contribution to the field of cybersecurity and digital
communication. The subsequent sections will delve deeper into the technical and business
aspects, laying the groundwork for a comprehensive and impactful solution.

- Team description
Project Manager(Maxim Egorov)
As a project manager, he needs to oversee the project's execution, ensuring deadlines and
deliverables are met. At the same time, he should manage resources, schedule meetings, and
resolve roadblocks.

Business Analyst(Md Yeashin Arafat)

Business analyst should define project requirements and bridge communication between
technical and business team. He is responsible for the documents business needs, validating
project goals and ensuring alignment with stakeholders.

Data Scientist(Nishi Kant Chandra)

The data scientist needs to prepare and preprocess data for model training. He needs to clean
data and ensure the data quality.
Model Developer(Li Shuai)
A model developer is capable of developing and refining machine learning models. The model
developer should ensure models are accurate, efficient, and usable, whether for machine
learning, financial forecasting, or engineering simulations.

Business Unit(Rustom Kobilov)

A business unit is a distinct division within an organization that focuses on a specific set of
products, services, or markets, operating with some level of autonomy. He needs to develop
strategies, manage operations, and drive revenue to achieve its objectives while aligning with the
company's overall goals.

- Work plan
Introduction: A work plan for the Antispam project outlines the objective of developing a
system to detect and block spam effectively. It specifies the scope, key tasks such as research,
development, and testing, and assigns responsibilities to team members. The plan includes a
timeline, necessary resources, and deliverables like a functional system and performance reports.

Phase 1: Project initiation and planning

Phase 2: Data preparation and analysis
Phase 3: Model development and validation
Phase 4: Application development and deployment
Phase 5: Results, business recommendations and future enhancements
Business part
Business goals : Create an automated system that allows companies to determine whether
incoming emails/messages are spam or important customer information. This can reduce the cost
of handling unwanted messages and improve customer service efficiency.
Objectives:
1.Build a machine learning model to classify messages into “spam” and “ham”.
2. Evaluate the quality of the data and pre-process it.
3. Train the model on the provided data, ensuring high accuracy and minimizing false positives
and false negatives.
4. Evaluate model metrics (accuracy, completeness, F1-score) and recommend a way to integrate
the model into the real system.
Technical part

Methods in general

Methods for Email Spam Detection

To tackle the issue of email spam effectively, our project employs a range of machine learning
techniques and methodologies grounded in data-driven approaches. By leveraging these
methods, we aim to design a robust system capable of accurately identifying and filtering spam
emails. The methods are as follows:

1. Data Preprocessing

Before applying machine learning models, the raw email dataset undergoes preprocessing to
ensure its suitability for analysis and modeling. This step includes:

● Data Cleaning: Removing duplicate entries, addressing missing values, and discarding
irrelevant information.
● Tokenization: Splitting email text into individual tokens (words or phrases) for further
analysis.
● Stopword Removal: Eliminating common but uninformative words (e.g., "the," "and,"
"is") that do not contribute to spam classification.
● Stemming and Lemmatization: Reducing words to their base or root forms to
standardize the text data.
● Label Encoding: Assigning numerical labels to categorical data, such as spam (1) and
non-spam (0).

2. Feature Extraction

Feature extraction involves converting email text into numerical representations that machine
learning algorithms can process. Common methods include:

● Bag of Words (BoW): Representing email content as a matrix of word counts or

frequencies.
● Term Frequency-Inverse Document Frequency (TF-IDF): Calculating the importance
of words in an email relative to a collection of emails.
● N-grams: Capturing sequences of words (e.g., bigrams, trigrams) to account for
contextual relationships.
● Header and Metadata Analysis: Extracting features from email headers, such as sender
information, subject lines, and timestamps.

3. Machine Learning Models

To classify emails as spam or non-spam, we employ various machine learning algorithms. The
models are selected based on their performance, interpretability, and computational efficiency:

● Logistic Regression: A simple and interpretable algorithm suitable for binary

classification tasks.
● Naïve Bayes: Particularly effective for text data due to its probabilistic approach and
assumption of feature independence.
● Support Vector Machines (SVM): A robust classifier that finds the optimal hyperplane
to separate spam and non-spam emails.
● Random Forest: An ensemble method that combines multiple decision trees to improve
classification accuracy and reduce overfitting.
● Gradient Boosting (e.g., XGBoost, LightGBM): Advanced ensemble techniques that
iteratively refine predictions by minimizing errors.
● Neural Networks: Deep learning models, such as recurrent neural networks (RNNs) or
transformers, for capturing complex patterns in email text.

4. Model Evaluation

To assess the performance of the selected models, we use standard evaluation metrics:

● Accuracy: The proportion of correctly classified emails.

● Precision: The percentage of emails classified as spam that are truly spam.
● Recall (Sensitivity): The percentage of actual spam emails correctly identified.
● F1 Score: The harmonic mean of precision and recall, balancing false positives and false
negatives.
● Confusion Matrix: A summary of classification outcomes, including true positives, false
positives, true negatives, and false negatives.
● ROC-AUC Curve: Measuring the model's ability to distinguish between spam and
non-spam emails across various thresholds.

5. Hyperparameter Tuning

To optimize the performance of the models, hyperparameter tuning is conducted using

techniques such as grid search or random search. This step involves adjusting parameters like
learning rates, tree depths, and regularization terms to achieve the best results.

6. Deployment and Integration

After selecting the best-performing model, the spam detection system will be integrated into
practical applications, such as:

● Mobile Applications: A user-friendly app for detecting spam emails on smartphones.

● Outlook Extensions: A plugin for Microsoft Outlook to automatically filter spam emails.
● Cloud-Based Services: An API to provide spam detection functionality for third-party
email clients and services.

By systematically implementing these methods, we aim to develop an effective and reliable

solution for email spam detection, contributing to enhanced cybersecurity and user productivity.

Data Cleaning Process and Results

Before performing data cleaning, the dataset contained 5574 rows and 2 columns:

1. type: The label of the message, indicating whether it is a regular message (ham) or spam
(spam).
2. text: The content of the message.

Initial Observations:
● No missing values: Both columns are fully populated, with no empty cells.
● Potential duplicates: The dataset might contain repeated entries, which could lead to biased
model training.
● Imbalance: The dataset likely has an imbalance between the ham and spam categories, as
non-spam messages are generally more common in real-world data.

Here’s a quick overview of the first few rows of the dataset:

Steps in the Data Cleaning Process

Below are the steps performed to prepare the data for further analysis:

1. Removing Duplicates

Duplicate rows can inflate certain classes (like spam) or create noise in the dataset. Using pandas'
.drop_duplicates()method:

● Initial row count: 5574.

● Duplicates removed: 414.
● Final row count: 5160.

2. Trimming Whitespace

The text column was checked for leading and trailing whitespaces. Even though these might not
seem significant, they can affect text preprocessing (e.g., tokenization).

We used the .str.strip() function to clean up the text.

3. Converting Labels to Numeric

For machine learning purposes, categorical labels need to be converted into numeric format. The
type column was mapped as:

● ham → 0.
● spam → 1.

4. Results After Data Cleaning

1. Final Dataset Shape:

○ The cleaned dataset contains 5160 rows and 2 columns.
2. Label Distribution:
○ Non-spam messages (ham): 4518 (approximately 87.5% of the data).
○ Spam messages (spam): 642 (approximately 12.5% of the data).
3. Key Observations:
○ The dataset is imbalanced, with a significantly higher number of non-spam
messages compared to spam. This imbalance should be addressed during model
training using techniques such as oversampling, undersampling, or using
weighted loss functions.

Summary

The data cleaning process ensured that the dataset was free from duplicates, inconsistencies, and
irrelevant whitespace. The labels were standardized for efficient processing, and the distribution
of spam vs. non-spam messages was identified for further consideration during model training.

These steps are crucial to improve the reliability and accuracy of the spam detection model. The
cleaned dataset is now ready for exploratory data analysis (EDA) and model development.
Exploratory Data Analysis (EDA):

For Data Analysis, Exploratory Data Analysis or EDA is crucial steps that involves exploring
data, analyzing data and visualizing data. Below, we are describe the step by step EDA process.

A. Import Libraries

For data pre processing we use different libraries for different uses. Like, we work with
Numpy and and Pandas for data processing and data wrangling. For visulization, we used
Matplotlib asd Seaborn tools. For our machine learning techniques, we use SKlearn
which is popular ML libraries in the python environment. Here is the our work for
importing libraries:
B. Reading The Dataset:

In the python, there are several ways to read the data set like, Json, csv, .xlsx, .sql. .txt,
images etc. We use the csv file for our project analysis.

C. Understanding the Analysis

Now, we are going to analze our data based on data set. At first, we need to check the basic info,
after that we are going to anlyze our data at in depth.

df.info()
From, the we saw that there are 02 columns and 5,574 entries for our analysis the spam project
from dataset.

df.head()

In that code, we saw that there are two types of categorical data. First one is ‘ham’ and the
second is ‘spam’. And, there are some text emails related to our categorical data.
From the dataset, we are find that, there are 5,574 rows and 2 column in our data set. And there
are 414 duplicate rows which are need to be cleaned later.

From, df.isnull().sum() code, we are trying to find find null value. And we didn’t find any null
value from our data set. But in the data set, there are some duplicates value. In our data set
‘ham’ email contains 4827 text out of 5574 email. That’s mean we have little amount of spam
email. And, “Sorry, I’will call later” this sentence 30 times in our data set. Furthermore, our
almost of the text is unique.
D. Data Wrangling

In the data wrangling, we develop our data-set in our work process way.

We create the data-set into two categories such as “Category” and “Message”. And, we add one
column name “spam” to detect our spam email.
E. Data Visulization

For data visulization, we used first pie chart visualization for comparing distribution of
spam and Ham emails.

From the pie chart, we got 86.60% ham email and 13.40% spam emails in our data set.
Secondly, We also find the most used word in our spam message through wordcloud plot
visulization code.

At first, we separate spam email from our data set coulum. And we put code to create
wordcloud chart. Below we present our chart.
From the above wordcloud plot, we saw some word are mostly used for spam message
such as 'free', 'call', 'text', 'txt' and 'now'.

F. Preparing Data in the PowerBi

PowerBI one of the powerful Business Intelent tools for using the data cleaning,
modeling and data visualizaion. At first, we used query to setup appropriate data for our
to crate simple DashBoard.

We renamed our column to the “Categories of email”and “Email Descriptio”. And, we

remove some rows and column from our dataset. We filtered some rows and deleted
errors data from our data set. After transform our data, we created a simple Dash-Board
based on our dataset.
Methods ML model

The dataset was split into training and testing sets using the train_test_split function from the
sklearn.model_selection module. The Message column containing the email text was used as the
input feature, and the Spam column, indicating whether the email is spam or not, served as the
target label. The data was split with 75% allocated to training and 25% to testing, ensuring robust
evaluation of the machine learning model.

The spam detection system was built using a Multinomial Naive Bayes model, which is
particularly suited for text classification tasks. The model achieved high accuracy and recall,
indicating its suitability for spam detection.

Performance Results:
The Multinomial Naive Bayes model demonstrated high performance. Training Set: ROC-AUC
score of 0.983, with 99.35% accuracy and a weighted F1-score of 99.35%. Testing Set:
ROC-AUC score of 0.957, with 98.49% accuracy and a weighted F1-score of 98.48%. The ROC
curve visualization highlighted the excellent separation between positive (spam) and negative
(ham) classes for both training and testing datasets.
Graph. ROC Curve

Primary Evaluation Metric:

Recall was chosen as the primary evaluation metric to minimize false negatives, ensuring that
spam emails are correctly identified. This choice aligns with the project’s objective to provide a
robust spam detection system that effectively filters out spam while minimizing the risk of
missed detections.

This structured approach, combining robust data preprocessing, model implementation, and
evaluation, ensures that the spam detection system is both accurate and interpretable, aligning
well with the project's goals.
Project results

Discussion of the Result

After using the Machine Learning model in the Python and Data preparation in PowerBI, we are making
our project result for Spam email detection. And, we are successfully detect spam email from our data set
and we also interpret our analysis in the PowerBI dash-board.

A. In the python ML model, we successfully identify the the spam email for project result.

In that code, “Free Tickets for IPL” sample email detected as spam email. The code give us result
in like way “This is a spam Email”. So, we detect spam email through our ML model.

B. In PowerBI we create simple Dash-Board to present our work.

In that, Dash-Board, We used Donat-chart and table to analyze the whole of the project in simple
way. In the Donat-Chart, the Spam email contains 13.41% of 100% and the amount of spam
emails is 750 out of 5574 emails. In the tabel, we are able simply sort the data based on “ham”
and “spam”.

Business Recommendation

This project has demonstrate that machine learning model successfully detect spam emails. This detection
will help us to minimize the spam email in our business project. We are going to be develop a mobile
application where email will be detect directly through the machine learning algorithm, which will help us
to save our time and money. Because of the spam email, people lost their money and time. Our mobile
apps will give the peoples to using their email safely.
Conclusion

In the world, email communication played a significant rule. So, betteling with spam email is very
important. In our dataset, 13 percent of emails are spam email, which is very alarming. That is why,
Detecting spam email is crucial for personal and business life. So, our project is to develop a robust email
spam detector based on Machine Learning model. We wanted to equip our detector to identify or separate
the ‘‘ham” and “spam” emails for our user. During our project, we identify some common words for using
spam emails such as 'free', 'call', 'text', 'txt' and 'now'. Nevertheless, our project target to
implement spam detection systems to minimize the spam email in our email box. We are looking
to safe our email inboxes from spam emails. We are looking forward to our project improvement
and innovative through machine learning algorithm. So, let’s say, “Save our emails from Spam”.
References
1. ResearchGate. (2023). Email Spam Detection using Machine Learning Techniques.
Retrieved from
https://www.researchgate.net/publication/367918824_Email_Spam_Detection_using_Ma
chine_Learning_Techniques.
2. Symantec Corporation. (2019). The State of Email Security. Retrieved from [Symantec
archives].
3. Gupta, S., & Nath, R. (2019). Machine Learning Applications in Spam Detection. Journal
of Cybersecurity.
4. EDA | Exploratory Data Analysis in Python

( https://www.geeksforgeeks.org/exploratory-data-analysis-in-python/)

5. Step-by-Step Exploratory Data Analysis (EDA) using Python

(https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-usin
g-python/#:~:text=Exploratory%20data%20analysis%20(EDA)%20is%20a%20critical%20initial
%20step%20in,trends%2C%20patterns%2C%20and%20relationships.)

Technical Manual - p211
50% (2)
Technical Manual - p211
64 pages
Astm D751-19
No ratings yet
Astm D751-19
3 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Email Spam Detection
No ratings yet
Email Spam Detection
13 pages
Final PPT
No ratings yet
Final PPT
18 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Vaibhav Tiwari Final Project
No ratings yet
Vaibhav Tiwari Final Project
32 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Devangi It Report
No ratings yet
Devangi It Report
22 pages
Spam Detection in Emails Using Machine Learning
No ratings yet
Spam Detection in Emails Using Machine Learning
81 pages
Spam Detection in Emails Using Machine Learning
No ratings yet
Spam Detection in Emails Using Machine Learning
56 pages
Email Spam Final
No ratings yet
Email Spam Final
32 pages
Final Report Spam Classifier
No ratings yet
Final Report Spam Classifier
24 pages
Mini - Project Report
No ratings yet
Mini - Project Report
21 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
No ratings yet
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
4 pages
Second Progress Report
No ratings yet
Second Progress Report
17 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Webinar: Definition, Basics, and Possible Uses: What Is A Webinar?
No ratings yet
Webinar: Definition, Basics, and Possible Uses: What Is A Webinar?
2 pages
Email Spam Detection Edited
No ratings yet
Email Spam Detection Edited
30 pages
NLP Report
No ratings yet
NLP Report
19 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
ML Lab
No ratings yet
ML Lab
13 pages
Amrit Science Campus: Submitted by
No ratings yet
Amrit Science Campus: Submitted by
35 pages
Email Report
No ratings yet
Email Report
15 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Spam Mail Classifier
No ratings yet
Spam Mail Classifier
8 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Spam Filter - Machine Learning
No ratings yet
Spam Filter - Machine Learning
25 pages
Reportfile
No ratings yet
Reportfile
10 pages
Spam 2023
No ratings yet
Spam 2023
11 pages
FICE Project Report Spam
No ratings yet
FICE Project Report Spam
14 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
CPP Report
No ratings yet
CPP Report
14 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
44 pages
Report
No ratings yet
Report
6 pages
Spam Detection 6
No ratings yet
Spam Detection 6
8 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
Kriti - Report FINAL
No ratings yet
Kriti - Report FINAL
11 pages
Standard Outdoor Substation Structure
No ratings yet
Standard Outdoor Substation Structure
63 pages
2020CSEPID63 - Spam Alert System Synopsis Final
No ratings yet
2020CSEPID63 - Spam Alert System Synopsis Final
12 pages
Email Spam Detector
No ratings yet
Email Spam Detector
12 pages
$RVJ44FQ
No ratings yet
$RVJ44FQ
13 pages
Ass1gnment - 1143
No ratings yet
Ass1gnment - 1143
3 pages
Table Content 1
No ratings yet
Table Content 1
3 pages
Workplace Health - Nurse Fatigue - Student
No ratings yet
Workplace Health - Nurse Fatigue - Student
47 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
CS329 2025 T10 Proposal Report
No ratings yet
CS329 2025 T10 Proposal Report
7 pages
STAF Diagnostics Guide
No ratings yet
STAF Diagnostics Guide
39 pages
Lect 1 Week 1 MQC
No ratings yet
Lect 1 Week 1 MQC
40 pages
Usa Wstda Standard Wstda WS 1 For Synthetic Web Slings
No ratings yet
Usa Wstda Standard Wstda WS 1 For Synthetic Web Slings
39 pages
ML
No ratings yet
ML
2 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Email Spam Detection
No ratings yet
Email Spam Detection
2 pages
Report
No ratings yet
Report
11 pages
Tests of Controls
No ratings yet
Tests of Controls
6 pages
Transport Layer Security (TLS)
No ratings yet
Transport Layer Security (TLS)
22 pages
Abhishek Mini Proj . File
No ratings yet
Abhishek Mini Proj . File
19 pages
0 - Spam Mail Prediction
No ratings yet
0 - Spam Mail Prediction
29 pages
Isa s84
No ratings yet
Isa s84
90 pages
Zoom
No ratings yet
Zoom
20 pages
Borghese and Torlonia Upper Chambers Black Nobility 2 Pages Aka Brown Burning Baby Party
No ratings yet
Borghese and Torlonia Upper Chambers Black Nobility 2 Pages Aka Brown Burning Baby Party
2 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Myanmar Economy P (Font TNM) Draft
No ratings yet
Myanmar Economy P (Font TNM) Draft
14 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
Bwu - Bta - 22 - 416 - Subrata Dhara Bta
No ratings yet
Bwu - Bta - 22 - 416 - Subrata Dhara Bta
12 pages
Vaudeville in Los Angeles, 1910-1926 - Theaters, Management, and The Orpheum
100% (2)
Vaudeville in Los Angeles, 1910-1926 - Theaters, Management, and The Orpheum
12 pages
W60HAP V2 Product Brochure
No ratings yet
W60HAP V2 Product Brochure
9 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
172rlx 6
No ratings yet
172rlx 6
8 pages
PFTL99720 0ed
No ratings yet
PFTL99720 0ed
6 pages
Case Study On Email Spam and Non
No ratings yet
Case Study On Email Spam and Non
5 pages
Draft Policy IP Medical Reimbursement As Per CGHS Rates
No ratings yet
Draft Policy IP Medical Reimbursement As Per CGHS Rates
2 pages
Accomplishment Report 1
No ratings yet
Accomplishment Report 1
5 pages
Proposal For Improvement of Infant Toddler Weighing Scale
No ratings yet
Proposal For Improvement of Infant Toddler Weighing Scale
6 pages
EPP (Sewing of Household Linens)
No ratings yet
EPP (Sewing of Household Linens)
4 pages
UT Dallas Syllabus For Taught by Alexander Edsel (Ade012000)
No ratings yet
UT Dallas Syllabus For Taught by Alexander Edsel (Ade012000)
10 pages
LC 2022 11 Rev0
No ratings yet
LC 2022 11 Rev0
2 pages
It 2042
No ratings yet
It 2042
11 pages
Tip Reflection 1
No ratings yet
Tip Reflection 1
4 pages
Fire Protection Gen. Notes
No ratings yet
Fire Protection Gen. Notes
1 page
JAS Forwarding Consent To Search (Multiple Shipments)
No ratings yet
JAS Forwarding Consent To Search (Multiple Shipments)
2 pages
Script
No ratings yet
Script
2 pages
Michael Jackson Resume 1
No ratings yet
Michael Jackson Resume 1
1 page
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Anti Spam

Uploaded by

Anti Spam

Uploaded by

NATIONAL RESEARCH UNIVERSITY HIGHER SCHOOL OF

Graduate School of Business

Course Name: Applied Data Science

Master’s Programme “Business Analytics and Big Data Systems”

Project Manager Maxim Egorov

Business Analyst Md Yeashin Arafat

Data Scientist Nishi Kant Chandra

Model Developer Li Shuai

Business Unit Rustom Kobilov

Relevance and Applications

Email spam detection is important for governments and organizations in addition to

● Improve cybersecurity by preventing phishing attacks and malware infections.

Approach and Methodology

Business Analyst(Md Yeashin Arafat)

Data Scientist(Nishi Kant Chandra)

Business Unit(Rustom Kobilov)

Phase 1: Project initiation and planning

Methods for Email Spam Detection

● Bag of Words (BoW): Representing email content as a matrix of word counts or

3. Machine Learning Models

● Logistic Regression: A simple and interpretable algorithm suitable for binary

● Accuracy: The proportion of correctly classified emails.

To optimize the performance of the models, hyperparameter tuning is conducted using

6. Deployment and Integration

● Mobile Applications: A user-friendly app for detecting spam emails on smartphones.

By systematically implementing these methods, we aim to develop an effective and reliable

Data Cleaning Process and Results

Here’s a quick overview of the first few rows of the dataset:

Steps in the Data Cleaning Process

● Initial row count: 5574.

We used the .str.strip() function to clean up the text.

4. Results After Data Cleaning

1. Final Dataset Shape:

C. Understanding the Analysis

F. Preparing Data in the PowerBi

We renamed our column to the “Categories of email”and “Email Descriptio”. And, we

Primary Evaluation Metric:

Discussion of the Result

B. In PowerBI we create simple Dash-Board to present our work.

5. Step-by-Step Exploratory Data Analysis (EDA) using Python

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.