DWM Exp5 C49

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

LAB Manual

PART A
(PART A: TO BE REFFERED BY STUDENTS)

Experiment No.05
A.1 Aim:
Implementation of Naïve Bayes Algorithm using any programming
language like JAVA,C++,C#

A.2 Prerequisite:
Familiarity with the programming languages

A.3 Outcome:
After successful completion of this experiment students will be able
to
Use classification and clustering algorithms of data mining.

A.4 Theory:

THEORY:

Introduction to Bayesian Classification


The Bayesian Classification represents a supervised learning method as
well as a statistical method for classification. Assumes an underlying
probabilistic model and it allows us to capture uncertainty about the
model in a principled way by determining probabilities of the outcomes. It
can solve diagnostic and predictive problems. This Classification is named
after Thomas Bayes (1702-1761), who proposed the Bayes Theorem.
Bayesian classification provides practical learning algorithms and prior
knowledge and observed data can be combined. Bayesian Classification
provides a useful perspective for understanding and evaluating many
learning algorithms. It calculates explicit probabilities for hypothesis and it
is robust to noise in input data. Uses of Naive Bayes classification:
1. Naive Bayes text classification (http://nlp.stanford.edu/IR-book/html/htmledition/naive-
bayes-text-classification-1.html) The Bayesian classification is used as a probabilistic learning
method (Naive Bayes text classification). Naive Bayes classifiers are among the most
successful known algorithms for learning to classify text documents.

2.

2. Spam filtering (http://en.wikipedia.org/wiki/Bayesian_spam_filtering)


Spam filtering is the best known use of Naive Bayesian text classification.
It makes use of a naive Bayes classifier to identify spam e-mail. Bayesian
spam filtering has become a popular mechanism to distinguish illegitimate
spam email from legitimate email (sometimes called "ham" or "bacn")[4]
Many modern mail clients implement Bayesian spam filtering. Users can
also install separate email filtering programs. Server-side email filters,
such as DSPAM, SpamAssassin, SpamBayes, Bogofilter and ASSP, make
use of Bayesian spam filtering techniques, and the functionality is
sometimes embedded within mail server software itself.
3. Hybrid Recommender System Using Naive Bayes Classifier and
Collaborative Filtering (http://eprints.ecs.soton.ac.uk/18483/)
Recommender Systems apply machine learning and data mining
techniques for filtering unseen information and can predict whether a user
would like a given resource.
It is proposed a unique switching hybrid recommendation approach by
combining a Naive Bayes classification approach with the collaborative
filtering. Experimental results on two different data sets, show that the
proposed algorithm is scalable and provide better performance–in terms
of accuracy and coverage–than other algorithms while at the same time
eliminates some recorded problems with the recommender systems.

4. Online applications (http://www.convo.co.uk/x02/) .This online


application has been set up as a simple example of supervised machine
learning and affective computing. Using a training set of examples which
reflect nice, nasty or neutral sentiments, we're training Ditto to distinguish
between them. Simple Emotion Modelling, combines a statistically based
classifier with a dynamical model. The Naive Bayes classifier employs
single words and word pairs as features. It allocates user utterances into
nice, nasty and neutral classes, labelled +1, -1 and 0 respectively. This
numerical output drives a simple first-order dynamical system, whose
state represents the simulated emotional state of the experiment's
personification, Ditto the donkey.

Naïve Bayesian Classification


It is based on the Bayesian theorem. It is particularly suited when the
dimensionality of the inputs is high. Parameter estimation for naive Bayes
models uses the method of maximum likelihood. In spite over-simplified
assumptions, it often performs better in many complex real world
situations.
Advantage: Requires a small amount of training data to estimate the
parameters

Given training data X, posteriori probability of a hypothesis H,


P(H|X), follows the Bayes’ theorem

P ( H | X ) = P (X | H ) P (H ) = P (X | H )× P (H ) / P (X )
P (X)
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Given:

Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643


P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes,
credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x
0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 =
0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer =
“yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer
= “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)

PART B
(PART B: TO BE COMPLETED BY STUDENTS)

(Students must submit the soft copy as per following segments


within two hours of the practical. The soft copy must be uploaded
on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black
board access available)

B.1 Software Code written by student:


(Paste your problem statement related to your case study completed during
the 2 hours of practical in the lab here)
import numpy as np
import pandas as pd
# Import necessary modules
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
d1 = pd. read_csv ('diabetes_csv.csv')
d1.head()
# Loading data
# Create feature and target arrays
X = d1[d1.columns[:-1]]
y = d1[d1.columns[-1]]
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.2, random_state=42)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
d1
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):",
metrics.accuracy_score(y_test, y_pred)*100)
# Predict on dataset which model has not seen before
print(gnb.predict(X_test))
B.2 Input and Output:
(Paste diagram of star schema and snowflake schema model related to your
case study in following format )
Jupyter Notebook
Jupyter/Binder:

Jupyter Notebook:
Sample Dataset:
Python Output:
Weka Tool
Weka Output:
Naive Bayes model accuracy (in %):
76.62337
662337663 (Using Python)
76.
3021 (Using Weka)
B.3 Observations and learning:
(Students are expected to comment on the output obtained with clear
observations and learning for each task/ sub part assigned)
We observed that Bayesian Classification represents a supervised learning
method as well as a statistical classification method. The Bayesian classification is
used as a probabilistic learning method as Naive Bayes classifiers are among the
most successful known algorithms for learning to classify the datasets. Naïve Bayes
classification Algorithm is one of the probabilistic algorithms which classifies the
datasets according to its knowledge data and creates the Result as per the given
knowledge.
B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual
outcome listed above and learning/observation noted in section B.3)
Naïve Bayes classification Algorithm is one of the probabilistic algorithms which
classify the datasets according to their knowledge data and creates the Result
as per the given knowledge.
Hence we’ve successfully implemented the Naive Bayesian Algorithm through
Python as well as Weka Tool.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and
learning/observations)
Q1: How many instances and attributes are in the data set?
To determine the number of instances and attributes in a dataset, you'll typically need to
look at the data itself. The number of instances refers to the total number of rows (or
observations), while the number of attributes refers to the total number of columns (or
features) in the dataset.
If you have the dataset in a specific format (like a CSV file, Excel sheet, etc.), you can
check its dimensions directly using data analysis tools or programming languages like
Python or R.
If you provide more details about the dataset or its format, I can help you with specific
instructions on how to find this information!
Q2: What is the minimum, maximum and mean values of that attributes?
To find the minimum, maximum, and mean values of specific attributes in a dataset,
you'll typically follow these steps using data analysis tools or programming languages.
Here’s how you can do it in Python using pandas:
1. **Load the Dataset**: First, import the dataset into a pandas DataFrame.
2. **Check the Attributes**: Identify the attributes you're interested in.
3. **Calculate Statistics**: Use pandas functions to calculate the minimum, maximum,
and mean.
Here’s a sample code snippet:
```python
import pandas as pd
# Load the dataset (replace 'your_dataset.csv' with your actual file)
df = pd.read_csv('your_dataset.csv')
# For a specific attribute (replace 'attribute_name' with your actual attribute name)
attribute_name = 'your_attribute_name'
min_value = df[attribute_name].min()
max_value = df[attribute_name].max()
mean_value = df[attribute_name].mean()
print(f"Minimum: {min_value}")
print(f"Maximum: {max_value}")
print(f"Mean: {mean_value}")
```
If you provide the specific attribute name or dataset, I can give more tailored guidance!
Q3: What is the accuracy of the classifier?
To determine the accuracy of a classifier, you'll typically follow these steps:
1. **Split the Dataset**: Divide your dataset into training and testing sets.
2. **Train the Classifier**: Use the training set to train your classification model.
3. **Make Predictions**: Use the trained model to make predictions on the testing set.
4. **Calculate Accuracy**: Compare the predicted labels to the actual labels and calculate
accuracy.
Here's how you might implement this in Python using scikit-learn:
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier # or any classifier you choose
from sklearn.metrics import accuracy_score
import pandas as pd
# Load the dataset
df = pd.read_csv('your_dataset.csv')
# Define your features (X) and target variable (y)
X = df.drop('target_column', axis=1) # Replace 'target_column' with your actual target
column
y = df['target_column']
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the classifier
classifier = RandomForestClassifier() # or any classifier you prefer
classifier.fit(X_train, y_train)
# Make predictions
y_pred = classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
```
### Notes:
- Replace `'your_dataset.csv'` and `'target_column'` with your dataset's actual filename
and target column name.
- You can choose different classifiers based on your needs (e.g., SVM, Decision Tree,
etc.).
- The `test_size` parameter in `train_test_split` determines the proportion of the dataset
to include in the test split (0.2 means 20% for testing).
If you have specific details about your dataset or classifier, feel free to share!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy