AttiqAhmadAfsarLab11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

11/23/24, 11:58 PM Untitled16.

ipynb - Colab

keyboard_arrow_down name : attiq ahmad afsar


registration number : B21F0206SE048

Exploratory Data Analysis (EDA)

Load the dataset. Identify numerical and categorical variables. Handle missing values. Visualize
data distributions and relationships. Feature Engineering

Encode categorical variables. Scale the numerical features. Model Training & Evaluation

Train K-NN and Gaussian Naive Bayes models. Evaluate performance using metrics like ROC-AUC
and cross-validation. Full Code with Explanations:

# Step 1: Load the necessary libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix

# Step 2: Load the dataset


data = pd.read_csv('adult.csv')

# Step 3: Exploratory Data Analysis (EDA)


# Display the first few rows of the dataset
print(data.head())

# Identify categorical and numerical columns


categorical_cols = data.select_dtypes(include=['object']).columns
numerical_cols = data.select_dtypes(exclude=['object']).columns

# Check for missing values


print(data.isnull().sum())

# Visualize the distribution of numerical columns


for col in numerical_cols:
plt.figure(figsize=(10, 6))
sns.histplot(data[col], kde=True)
plt.title(f'Distribution of {col}')
plt.show()

# Step 4: Feature Engineering


Handling missing values (if any)

# Encoding categorical variables using Label Encoding

https://colab.research.google.com/drive/1RCyYhwjNGf2dyN-YMS7oxIk3Z8DRR30K#scrollTo=uBRduAQcXOxk&printMode=true 1/6
11/23/24, 11:58 PM Untitled16.ipynb - Colab
encoder = LabelEncoder()
for col in categorical_cols:
data[col] = encoder.fit_transform(data[col])

# Split data into features (X) and target variable (y)


X = data.drop('target_column', axis=1) # Replace 'target_column' with the actual target column name
y = data['target_column'] # Replace 'target_column' with the actual target column name

# Scale the numerical features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 5: Train-Test Split


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Step 6: Model Training and Evaluation

# Train K-Nearest Neighbors (K-NN) model


knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn_model.predict(X_test)

# Evaluate K-NN model performance


print(f'K-NN Accuracy: {accuracy_score(y_test, y_pred_knn):.4f}')
roc_auc_knn = roc_auc_score(y_test, y_pred_knn)
print(f'K-NN ROC-AUC: {roc_auc_knn:.4f}')

# Confusion Matrix for K-NN


cm_knn = confusion_matrix(y_test, y_pred_knn)
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for K-NN')
plt.show()

# Train Gaussian Naive Bayes (GNB) model


gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)

# Make predictions
y_pred_gnb = gnb_model.predict(X_test)

# Evaluate GNB model performance


print(f'GNB Accuracy: {accuracy_score(y_test, y_pred_gnb):.4f}')
roc_auc_gnb = roc_auc_score(y_test, y_pred_gnb)
print(f'GNB ROC-AUC: {roc_auc_gnb:.4f}')

# Confusion Matrix for GNB


cm_gnb = confusion_matrix(y_test, y_pred_gnb)
sns.heatmap(cm_gnb, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for GNB')
plt.show()

# Step 7: ROC Curve Comparison


fpr_knn, tpr_knn, _ = roc_curve(y_test, knn_model.predict_proba(X_test)[:, 1])
fpr_gnb, tpr_gnb, _ = roc_curve(y_test, gnb_model.predict_proba(X_test)[:, 1])

plt.figure(figsize=(10, 6))
plt.plot(fpr_knn, tpr_knn, color='blue', label=f'K-NN ROC (AUC = {roc_auc_knn:.4f})')
plt.plot(fpr_gnb, tpr_gnb, color='red', label=f'GNB ROC (AUC = {roc_auc_gnb:.4f})')
plt.plot([0, 1], [0, 1], color='black', linestyle='--')

https://colab.research.google.com/drive/1RCyYhwjNGf2dyN-YMS7oxIk3Z8DRR30K#scrollTo=uBRduAQcXOxk&printMode=true 2/6
11/23/24, 11:58 PM Untitled16.ipynb - Colab
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.show()

# Step 8: Cross-validation (for both models)


cv_knn = cross_val_score(knn_model, X_scaled, y, cv=5, scoring='accuracy')
cv_gnb = cross_val_score(gnb_model, X_scaled, y, cv=5, scoring='accuracy')

print(f'K-NN Cross-validation scores: {cv_knn}')


print(f'GNB Cross-validation scores: {cv_gnb}')

# Compute the average cross-validation score for each model


print(f'Average K-NN Cross-validation score: {cv_knn.mean():.4f}')
print(f'Average GNB Cross-validation score: {cv_gnb.mean():.4f}')

https://colab.research.google.com/drive/1RCyYhwjNGf2dyN-YMS7oxIk3Z8DRR30K#scrollTo=uBRduAQcXOxk&printMode=true 3/6
11/23/24, 11:58 PM Untitled16.ipynb - Colab

https://colab.research.google.com/drive/1RCyYhwjNGf2dyN-YMS7oxIk3Z8DRR30K#scrollTo=uBRduAQcXOxk&printMode=true 4/6
11/23/24, 11:58 PM Untitled16.ipynb - Colab

https://colab.research.google.com/drive/1RCyYhwjNGf2dyN-YMS7oxIk3Z8DRR30K#scrollTo=uBRduAQcXOxk&printMode=true 5/6
11/23/24, 11:58 PM Untitled16.ipynb - Colab

Explanation of the Code: Loading the Dataset:

pd.read csv() loads the dataset into a Pandas DataFrame. Exploratory Data Analysis (EDA):

https://colab.research.google.com/drive/1RCyYhwjNGf2dyN-YMS7oxIk3Z8DRR30K#scrollTo=uBRduAQcXOxk&printMode=true 6/6

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy