0% found this document useful (0 votes)
5 views4 pages

Spam Detection Model

This document outlines the implementation of a spam detection model using the Naive Bayes classifier, specifically the MultinomialNB variant from scikit-learn. It details the data preprocessing steps, including importing libraries, vectorizing text data, splitting the dataset, training the model, and evaluating its performance. The model also includes a user interface for predicting whether a message is spam and highlighting spam-indicative words.

Uploaded by

githouse36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Spam Detection Model

This document outlines the implementation of a spam detection model using the Naive Bayes classifier, specifically the MultinomialNB variant from scikit-learn. It details the data preprocessing steps, including importing libraries, vectorizing text data, splitting the dataset, training the model, and evaluating its performance. The model also includes a user interface for predicting whether a message is spam and highlighting spam-indicative words.

Uploaded by

githouse36
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Madda Walabu University

Collage of Computing Department of


Computer Science
3rd year second semester 2024

AI GROUP ASSIGNMENT

BY :
|NAMUSA HASSAN UGR/22318/13
BEKAM UGR/
Spam Detection Model
Spam Detection Using Naive Bayes Classifier

1. Introduction

Spam detection is an essential application in the domain of Natural Language Processing (NLP).
The goal is to classify email messages as either "spam" or "ham" (non-spam). In this document,
we discuss the implementation of a spam detection model using the Naive Bayes classifier and
describe the data preprocessing steps undertaken.

2. Model Used: Naive Bayes Classifier

The model used in your project is the Naive Bayes classifier, specifically the MultinomialNB
variant from the scikit-learn library. This model is particularly effective for text classification
tasks, making it well-suited for spam detection. Naive Bayes operates on the principle of Bayes'
Theorem, assuming that the presence (or absence) of a particular feature in a class is independent
of the presence (or absence) of any other feature.

3. Data Preprocessing Steps

Step 1: Importing Libraries and Loading the Dataset

To start, we imported essential libraries for data handling, text processing, model building, and
evaluation. The dataset containing emails and their labels (spam or ham) was then loaded into a
DataFrame for further processing.

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score

dataset = pd.read_csv('emails.csv')

Step 2: Vectorizing Text Data

The raw text data from the emails needed to be converted into a numerical format that the Naive
Bayes model could process. This was achieved using a technique called vectorization.
The CountVectorizer was used to transform the text into a matrix of token counts, where each
column represents a unique word, and each row corresponds to an email
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(dataset['text'])

y = dataset['spam']

Step 3: Splitting the Dataset

The dataset was split into two parts: a training set and a testing set. The training set is used to
train the model, while the testing set is used to evaluate the model's performance. An 80-20 split
was chosen, meaning 80% of the data was used for training and 20% was reserved for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Training the Naive Bayes Model

With the text data vectorized and the dataset split, the Naive Bayes model was trained on the
training set. This involves fitting the model to the training data, allowing it to learn the patterns
and characteristics of spam and ham emails.

model = MultinomialNB()

model.fit(X_train, y_train)

Step 5: Evaluating the Model

After training, the model's performance was evaluated using the testing set. This was done by
predicting the labels of the test data and comparing them to the actual labels. The accuracy score
was calculated to measure how well the model could correctly classify emails as spam or ham.

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Model Accuracy:", accuracy)

Step 6: Predicting and Highlighting Spam Words

To enhance the model's usability and interpretability, a function was defined to predict whether a
given message is spam and to highlight key words that indicate spam. This function takes a
message as input, predicts its spam probability, and identifies words that are strongly associated
with spam.

def predictMessage(message):

message_vector = vectorizer.transform([message])

prediction = model.predict(message_vector)
spam_probability = model.predict_proba(message_vector)[0][1]

feature_names = vectorizer.get_feature_names_out()

log_probabilities = model.feature_log_prob_

spam_weights = log_probabilities[1]

message_words = message.split()

spam_words = [

word for word in message_words

if word in feature_names and spam_weights[feature_names.tolist().index(word)] > -1 # Adjust threshold as needed ]

result = "Spam" if prediction[0] == 1 else "Ham"

return {

"result": result,

"spam_probability": spam_probability,

"spam_words": spam_words

User Interface:

The model is designed to interact with users by predicting the nature of the entered message
(spam or ham) and highlighting spam-indicative words.

userMessage = input('Enter text to predict: ')

prediction = predictMessage(userMessage)

print(f"The message is: {prediction['result']}")

print(f"Spam Probability: {prediction['spam_probability']}")

print(f"Spam Words Highlighted: {prediction['spam_words']}")

4. Conclusion

The Naive Bayes classifier proves to be an effective and straightforward method for spam
detection in text data. The preprocessing steps, including vectorization and data splitting, are
crucial in transforming the raw text into a format suitable for model training and evaluation. The
inclusion of a function to predict and highlight spam words enhances the interpretability and
usability of the model in practical applications.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy