Mini Project Sushant 612210154

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Step 1: Data Acquisition and Understanding

1. Load required Libraries and Dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

df = pd.read_csv(‘Dry_Bean_Dataset.csv')
df.head()

2. Perform initial Exploratory Data Analysis (EDA) to understand basic statistics:

df.info()
df.describe()

3. Check for missing values:

df.isnull().sum()

Step 2: Data Preprocessing and Transformation

1. Handling Missing Values: If the dataset has missing values, we can handle them by
imputing the mean for numerical columns or using forward-fill for categorical columns.
2. Spliting dataset using train_test_split:

X = df.drop(columns=['Class'])
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

3. Feature scaling on data:

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 3: Data Visualization

1. Correlation Matrix to see relationships between features:

import seaborn as sns


import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

2. Distribution Plots of key features to understand the spread of values:

sns.displot(df[Area])
sns.displot(df['MajorAxisLength'])

3. Pair Plots to visualize relationships between input features:

sns.pairplot(df[['Area', 'Perimeter', 'MajorAxisLength',


'MinorAxisLength','AspectRation']], hue='Class')

Step 4: Model Building

Implement multiple models for comparison:

1. Logistic Regression:

from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression()
logreg.fit(X_train, y_train)

2. Support Vector Machine (SVM):

from sklearn.svm import SVC


svm = SVC()
svm.fit(X_train, y_train)

3. K-Nearest Neighbors (KNN):

from sklearn.neighbors import KNeighborsClassifier


knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

Step 5: Model Evaluation

1. Accuracy:

Logistic Regression

from sklearn.metrics import accuracy_score

y_pred_logreg = logreg.predict(X_test)
print("Accuracy for Logistic Regression:", accuracy_score(y_test,
y_pred_logreg))
Support Vector Machine (SVM):

from sklearn.metrics import accuracy_score

y_pred_svm = svm.predict(X_test)
print("Accuracy for SVM:", accuracy_score(y_test, y_pred_svm))

K-Nearest Neighbors (KNN):

from sklearn.metrics import accuracy_score

y_pred_knn = knn.predict(X_test)
print("Accuracy for KNN:", accuracy_score(y_test, y_pred_knn))

2. Accuracy across models to select the best one. For instance:


o Logistic Regression: 92.28% accuracy
o SVM : 93.29% accuracy
o K-Nearest Neighbors (KNN): 92.16% accuracy

Conclusion and Insights

The best-performing model is Support Vector Machine (SVM) , with an accuracy of 93.29%.
The most important features contributing to the prediction of Dry Beans are Area,
MajorAxisLength, Perimeter.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy