AttiqAhmadAfsarLab11

11/23/24, 11:58 PM Untitled16.
ipynb - Colab
keyboard_arrow_down name : attiq ahmad afsar

registration number : B21F0206SE048
Exploratory Data Analysis (EDA)
Load the dataset. Identify numerical and categorical variables. Handle missing values. Visualize
data distributions and relationships. Feature Engineering
Encode categorical variables. Scale the numerical features. Model Training & Evaluation
Train K-NN and Gaussian Naive Bayes models. Evaluate performance using metrics like ROC-AUC
and cross-validation. Full Code with Explanations:
# Step 1: Load the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve, confusion_matrix
# Step 2: Load the dataset

data = pd.read_csv('adult.csv')
# Step 3: Exploratory Data Analysis (EDA)

# Display the first few rows of the dataset
print(data.head())
# Identify categorical and numerical columns

categorical_cols = data.select_dtypes(include=['object']).columns
numerical_cols = data.select_dtypes(exclude=['object']).columns
# Check for missing values

print(data.isnull().sum())
# Visualize the distribution of numerical columns

for col in numerical_cols:
plt.figure(figsize=(10, 6))
sns.histplot(data[col], kde=True)
plt.title(f'Distribution of {col}')
plt.show()
# Step 4: Feature Engineering

Handling missing values (if any)
# Encoding categorical variables using Label Encoding
https://colab.research.google.com/drive/1RCyYhwjNGf2dyN-YMS7oxIk3Z8DRR30K#scrollTo=uBRduAQcXOxk&printMode=true 1/6
11/23/24, 11:58 PM Untitled16.ipynb - Colab
encoder = LabelEncoder()
for col in categorical_cols:
data[col] = encoder.fit_transform(data[col])
# Split data into features (X) and target variable (y)

X = data.drop('target_column', axis=1) # Replace 'target_column' with the actual target column name
y = data['target_column'] # Replace 'target_column' with the actual target column name
# Scale the numerical features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 5: Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# Step 6: Model Training and Evaluation
# Train K-Nearest Neighbors (K-NN) model

knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
# Make predictions
y_pred_knn = knn_model.predict(X_test)
# Evaluate K-NN model performance

print(f'K-NN Accuracy: {accuracy_score(y_test, y_pred_knn):.4f}')
roc_auc_knn = roc_auc_score(y_test, y_pred_knn)
print(f'K-NN ROC-AUC: {roc_auc_knn:.4f}')
# Confusion Matrix for K-NN

cm_knn = confusion_matrix(y_test, y_pred_knn)
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for K-NN')
plt.show()
# Train Gaussian Naive Bayes (GNB) model

gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)
# Make predictions
y_pred_gnb = gnb_model.predict(X_test)
# Evaluate GNB model performance

print(f'GNB Accuracy: {accuracy_score(y_test, y_pred_gnb):.4f}')
roc_auc_gnb = roc_auc_score(y_test, y_pred_gnb)
print(f'GNB ROC-AUC: {roc_auc_gnb:.4f}')
# Confusion Matrix for GNB

cm_gnb = confusion_matrix(y_test, y_pred_gnb)
sns.heatmap(cm_gnb, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix for GNB')
plt.show()
# Step 7: ROC Curve Comparison

fpr_knn, tpr_knn, _ = roc_curve(y_test, knn_model.predict_proba(X_test)[:, 1])
fpr_gnb, tpr_gnb, _ = roc_curve(y_test, gnb_model.predict_proba(X_test)[:, 1])
plt.figure(figsize=(10, 6))
plt.plot(fpr_knn, tpr_knn, color='blue', label=f'K-NN ROC (AUC = {roc_auc_knn:.4f})')
plt.plot(fpr_gnb, tpr_gnb, color='red', label=f'GNB ROC (AUC = {roc_auc_gnb:.4f})')
plt.plot([0, 1], [0, 1], color='black', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve Comparison')
plt.legend(loc='lower right')
plt.show()
# Step 8: Cross-validation (for both models)

cv_knn = cross_val_score(knn_model, X_scaled, y, cv=5, scoring='accuracy')
cv_gnb = cross_val_score(gnb_model, X_scaled, y, cv=5, scoring='accuracy')
print(f'K-NN Cross-validation scores: {cv_knn}')

print(f'GNB Cross-validation scores: {cv_gnb}')
# Compute the average cross-validation score for each model

print(f'Average K-NN Cross-validation score: {cv_knn.mean():.4f}')
print(f'Average GNB Cross-validation score: {cv_gnb.mean():.4f}')
Explanation of the Code: Loading the Dataset:
pd.read csv() loads the dataset into a Pandas DataFrame. Exploratory Data Analysis (EDA):

AttiqAhmadAfsarLab11

Uploaded by

Copyright:

Available Formats

AttiqAhmadAfsarLab11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AttiqAhmadAfsarLab11

Uploaded by

Copyright:

Available Formats

11/23/24, 11:58 PM Untitled16.

keyboard_arrow_down name : attiq ahmad afsar

Exploratory Data Analysis (EDA)

# Step 1: Load the necessary libraries

# Step 2: Load the dataset

# Step 3: Exploratory Data Analysis (EDA)

# Identify categorical and numerical columns

# Check for missing values

# Visualize the distribution of numerical columns

# Step 4: Feature Engineering

# Encoding categorical variables using Label Encoding

# Split data into features (X) and target variable (y)

# Scale the numerical features

# Step 5: Train-Test Split

# Step 6: Model Training and Evaluation

# Train K-Nearest Neighbors (K-NN) model

# Evaluate K-NN model performance

# Confusion Matrix for K-NN

# Train Gaussian Naive Bayes (GNB) model

# Evaluate GNB model performance

# Confusion Matrix for GNB

# Step 7: ROC Curve Comparison

# Step 8: Cross-validation (for both models)

print(f'K-NN Cross-validation scores: {cv_knn}')

# Compute the average cross-validation score for each model

Explanation of the Code: Loading the Dataset:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.