DWM Exp5 C49
DWM Exp5 C49
DWM Exp5 C49
PART A
(PART A: TO BE REFFERED BY STUDENTS)
Experiment No.05
A.1 Aim:
Implementation of Naïve Bayes Algorithm using any programming
language like JAVA,C++,C#
A.2 Prerequisite:
Familiarity with the programming languages
A.3 Outcome:
After successful completion of this experiment students will be able
to
Use classification and clustering algorithms of data mining.
A.4 Theory:
THEORY:
2.
P ( H | X ) = P (X | H ) P (H ) = P (X | H )× P (H ) / P (X )
P (X)
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
Given:
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
PART B
(PART B: TO BE COMPLETED BY STUDENTS)
Jupyter Notebook:
Sample Dataset:
Python Output:
Weka Tool
Weka Output:
Naive Bayes model accuracy (in %):
76.62337
662337663 (Using Python)
76.
3021 (Using Weka)
B.3 Observations and learning:
(Students are expected to comment on the output obtained with clear
observations and learning for each task/ sub part assigned)
We observed that Bayesian Classification represents a supervised learning
method as well as a statistical classification method. The Bayesian classification is
used as a probabilistic learning method as Naive Bayes classifiers are among the
most successful known algorithms for learning to classify the datasets. Naïve Bayes
classification Algorithm is one of the probabilistic algorithms which classifies the
datasets according to its knowledge data and creates the Result as per the given
knowledge.
B.4 Conclusion:
(Students must write the conclusion as per the attainment of individual
outcome listed above and learning/observation noted in section B.3)
Naïve Bayes classification Algorithm is one of the probabilistic algorithms which
classify the datasets according to their knowledge data and creates the Result
as per the given knowledge.
Hence we’ve successfully implemented the Naive Bayesian Algorithm through
Python as well as Weka Tool.
B.5 Question of Curiosity
(To be answered by student based on the practical performed and
learning/observations)
Q1: How many instances and attributes are in the data set?
To determine the number of instances and attributes in a dataset, you'll typically need to
look at the data itself. The number of instances refers to the total number of rows (or
observations), while the number of attributes refers to the total number of columns (or
features) in the dataset.
If you have the dataset in a specific format (like a CSV file, Excel sheet, etc.), you can
check its dimensions directly using data analysis tools or programming languages like
Python or R.
If you provide more details about the dataset or its format, I can help you with specific
instructions on how to find this information!
Q2: What is the minimum, maximum and mean values of that attributes?
To find the minimum, maximum, and mean values of specific attributes in a dataset,
you'll typically follow these steps using data analysis tools or programming languages.
Here’s how you can do it in Python using pandas:
1. **Load the Dataset**: First, import the dataset into a pandas DataFrame.
2. **Check the Attributes**: Identify the attributes you're interested in.
3. **Calculate Statistics**: Use pandas functions to calculate the minimum, maximum,
and mean.
Here’s a sample code snippet:
```python
import pandas as pd
# Load the dataset (replace 'your_dataset.csv' with your actual file)
df = pd.read_csv('your_dataset.csv')
# For a specific attribute (replace 'attribute_name' with your actual attribute name)
attribute_name = 'your_attribute_name'
min_value = df[attribute_name].min()
max_value = df[attribute_name].max()
mean_value = df[attribute_name].mean()
print(f"Minimum: {min_value}")
print(f"Maximum: {max_value}")
print(f"Mean: {mean_value}")
```
If you provide the specific attribute name or dataset, I can give more tailored guidance!
Q3: What is the accuracy of the classifier?
To determine the accuracy of a classifier, you'll typically follow these steps:
1. **Split the Dataset**: Divide your dataset into training and testing sets.
2. **Train the Classifier**: Use the training set to train your classification model.
3. **Make Predictions**: Use the trained model to make predictions on the testing set.
4. **Calculate Accuracy**: Compare the predicted labels to the actual labels and calculate
accuracy.
Here's how you might implement this in Python using scikit-learn:
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier # or any classifier you choose
from sklearn.metrics import accuracy_score
import pandas as pd
# Load the dataset
df = pd.read_csv('your_dataset.csv')
# Define your features (X) and target variable (y)
X = df.drop('target_column', axis=1) # Replace 'target_column' with your actual target
column
y = df['target_column']
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the classifier
classifier = RandomForestClassifier() # or any classifier you prefer
classifier.fit(X_train, y_train)
# Make predictions
y_pred = classifier.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
```
### Notes:
- Replace `'your_dataset.csv'` and `'target_column'` with your dataset's actual filename
and target column name.
- You can choose different classifiers based on your needs (e.g., SVM, Decision Tree,
etc.).
- The `test_size` parameter in `train_test_split` determines the proportion of the dataset
to include in the test split (0.2 means 20% for testing).
If you have specific details about your dataset or classifier, feel free to share!