0% found this document useful (0 votes)
5 views

Assignment

This document presents an assignment on the application of the K-Nearest Neighbors (KNN) algorithm for breast cancer diagnosis, detailing the entire workflow from data acquisition to model evaluation. It includes sections on dataset exploration, preprocessing, methodology, and model performance metrics, highlighting the effectiveness of KNN in clinical decision support. The report concludes with suggestions for future work to enhance model accuracy and diagnostic capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Assignment

This document presents an assignment on the application of the K-Nearest Neighbors (KNN) algorithm for breast cancer diagnosis, detailing the entire workflow from data acquisition to model evaluation. It includes sections on dataset exploration, preprocessing, methodology, and model performance metrics, highlighting the effectiveness of KNN in clinical decision support. The report concludes with suggestions for future work to enhance model accuracy and diagnostic capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Atish Dipankar University of Science &

Technology
Department of Computer Science and Engineering

Assignment
on KNN using Breast Cancer Dataset

K-Nearest Neighbors for Breast Cancer Diagnosis

Student: Kholipha Ahmmad Al-Amin


ID: 221-0217-203

Course: Microprocessor & Assembly Languages


CSE-317/318 [Section-01]

Teacher: Prof. Mahmudur Rahman Roni


Coordinator, Department of CSE

March 18, 2025


KNN on Breast Cancer Dataset ID: 221-0217-203

Contents
1 Introduction 2

2 Dataset and Exploratory Data Analysis 2

3 Data Preprocessing 3

4 Methodology 3
4.1 K-Nearest Neighbors (KNN) Classifier . . . . . . . . . . . . . . . . . . . 3
4.2 Enhanced Machine Learning Pipeline . . . . . . . . . . . . . . . . . . . . 3
4.3 Feature Importance and Analysis . . . . . . . . . . . . . . . . . . . . . . 4

5 Model Training and Evaluation 4


5.1 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5.2 Expanded Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 4
5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

6 Python Code Overview 5

7 Conclusion and Future Work 7

Appendix 7

1
KNN on Breast Cancer Dataset ID: 221-0217-203

Abstract
Abstract
This comprehensive ultra pro legendary report investigates the application
of the K-Nearest Neighbors (KNN) algorithm for breast cancer diagnosis.
The document details an end-to-end workflowfrom data acquisition and
preprocessing to model training, evaluation, and error analysisaugmented
with state-of-the-art visualizations and analytical insights. This framework
not only demonstrates the potential of KNN in clinical decision support but
also sets a benchmark for future diagnostic research.

1 Introduction
Breast cancer continues to be a significant global health concern. With the advent of
machine learning, innovative approacheslike the KNN classifierhave emerged as pivotal tools
for early detection and treatment planning. This report provides a detailed walkthrough
of constructing a KNN-based diagnostic model, covering all essential phases such as
data exploration, preprocessing, model implementation, evaluation, and error analysis.
Additionally, the report enriches the discussion with feature importance analysis and
insights into model limitations, ensuring a well-rounded investigation.

2 Dataset and Exploratory Data Analysis


The dataset includes numerous features representing tumor characteristics along with
diagnosis labels. Key initial insights include:

• A robust, well-organized dataset with significant statistical properties.

• Elimination of redundant columns (e.g., Unnamed: 32) ensuring data clarity.

• A clear understanding of feature distributions achieved through detailed visualizations.

Figure 1 presents an exemplary summary of the Exploratory Data Analysis (EDA),


showcasing the distribution of vital variables and statistical insights.

2
KNN on Breast Cancer Dataset ID: 221-0217-203

Figure 1: Exploratory Data Analysis Summary

3 Data Preprocessing
Data preprocessing is crucial to prepare the dataset for modeling. The steps include:

1. Data Cleaning: Removing extraneous columns and correcting inconsistencies.

2. Missing Value Imputation: Employing median imputation to handle any missing data.

3. Feature Scaling: Utilizing StandardScaler for normalization, ensuring that all features
contribute equally.

4. Data Splitting: Dividing the dataset into 70% for training and 30% for testing to
validate model performance.

4 Methodology
4.1 K-Nearest Neighbors (KNN) Classifier
The KNN algorithm classifies instances based on their proximity to training examples in
feature space. In this study:

• The hyperparameter k is set to 3, as determined by preliminary experimentation.

• The Euclidean distance metric is applied to measure similarity between instances.

4.2 Enhanced Machine Learning Pipeline


Figure 2 illustrates the enhanced machine learning pipeline with a detailed TikZ diagram.
The diagram highlights each step, providing a clear roadmap from data acquisition to
model evaluation.

3
KNN on Breast Cancer Dataset ID: 221-0217-203

Data Acquisition Data Cleaning Imputation Feature Scaling Train-Test Split

Model Evaluation KNN Training

Figure 2: Enhanced Machine Learning Pipeline for Breast Cancer Diagnosis

4.3 Feature Importance and Analysis


Understanding which features contribute most to model predictions can guide further
improvements. Although KNN does not inherently provide feature importance scores, the
following methods can be employed:

• Correlation Analysis: Evaluating the Pearson correlation coefficient between each


feature and the diagnosis.

• Recursive Feature Elimination (RFE): Utilizing RFE with alternative classifiers to


rank feature importance.

• Visualization: Creating heatmaps and pair plots to visually assess feature interactions.

5 Model Training and Evaluation


5.1 Training Process
Post-preprocessing, the KNN classifier was trained using the designated training set. The
evaluation phase included:

• Accuracy Score: Providing an overall measure of prediction accuracy.

• Confusion Matrix: Visualizing the true versus predicted labels.

• Classification Report: Detailing precision, recall, and F1-score for each diagnosis
category.

5.2 Expanded Evaluation Metrics


In addition to traditional metrics, further evaluation is provided through:

• Receiver Operating Characteristic (ROC) Curve: Illustrating the diagnostic ability


of the model.

• Area Under the Curve (AUC): Quantifying the model’s overall performance.

5.3 Results and Discussion


The classifier achieved an accuracy of approximately XX% (please update with actual
value). Figure 3 displays the confusion matrix, and additional ROC analysis is discussed
in the subsequent section. The model’s performance demonstrates strong predictive power,

4
KNN on Breast Cancer Dataset ID: 221-0217-203

while the error analysis highlights potential areas for hyperparameter tuning and further
feature engineering.

Figure 3: Confusion Matrix of the KNN Classifier

6 Python Code Overview


For transparency and reproducibility, the complete Python code used for this study is
provided below. The code encompasses all stagesfrom data loading and preprocessing to
model training and evaluation.
1 import pandas as pd
2 import numpy as np
3 import matplotlib . pyplot as plt
4 import seaborn as sns
5 from sklearn . model_selection import train_test_split
6 from sklearn . preprocessing import StandardScaler
7 from sklearn . neighbors import K N e i g h b o r s C l a s s i f i e r
8 from sklearn . metrics import accuracy_score , confusion_matrix ,
classification_report , roc_curve , auc
9

10 df = pd . read_csv ( ’/ content /2025 -03 -12 B r e a s t _ C a n c e r _ D i a g n o s t i c . csv


’ , index_col =0)
11 if ’ Unnamed : 32 ’ in df . columns :
12 df = df . drop ( ’ Unnamed : 32 ’ , axis =1)
13 print ( " Dataset Head : " )
14 print ( df . head () )
15

16 target = df [ ’ diagnosis ’]
17 features = df . drop ( ’ diagnosis ’ , axis =1)
18

5
KNN on Breast Cancer Dataset ID: 221-0217-203

19 if features . isnull () . sum () . sum () > 0:


20 print ( " Missing values detected . Imputing using median values .
")
21 features = features . fillna ( features . median () )
22

23 scaler = StandardScaler ()
24 scaled_features = scaler . fit_transform ( features )
25 df_scaled = pd . DataFrame ( scaled_features , columns = features .
columns )
26 print ( " \ nScaled Features Head : " )
27 print ( df_scaled . head () )
28

29 plt . figure ( figsize =(12 , 8) )


30 plt . subplot (2 , 2 , 1)
31 sns . histplot ( df [ " radius_mean " ] , kde = True , bins =20 , color = " blue " )
32 plt . title ( " Distribution of Radius Mean " )
33 plt . subplot (2 , 2 , 2)
34 sns . histplot ( df [ " texture_mean " ] , kde = True , bins =20 , color = " green "
)
35 plt . title ( " Distribution of Texture Mean " )
36 plt . subplot (2 , 2 , 3)
37 sns . boxplot ( x = df [ " diagnosis " ] , y = df [ " radius_mean " ] , palette = "
coolwarm " )
38 plt . title ( " Radius Mean by Diagnosis " )
39 plt . subplot (2 , 2 , 4)
40 sns . boxplot ( x = df [ " diagnosis " ] , y = df [ " texture_mean " ] , palette = "
coolwarm " )
41 plt . title ( " Texture Mean by Diagnosis " )
42 plt . tight_layout ()
43 plt . savefig ( " eda_summary . png " , dpi =1200)
44 plt . show ()
45

46 X_train , X_test , y_train , y_test = train_test_split (


47 scaled_features , target , test_size =0.30 , random_state =42
48 )
49

50 knn = KNe i g h b o r s C l a s s i f i e r ( n_neighbors =3)


51 knn . fit ( X_train , y_train )
52 y_pred = knn . predict ( X_test )
53

54 accuracy = accuracy_score ( y_test , y_pred )


55 conf_mat = confusion_matrix ( y_test , y_pred )
56 report = c l a s s i f i c a t i o n _ r e p o r t ( y_test , y_pred )
57 print ( " \ nAccuracy : " , accuracy )
58 print ( " \ nConfusion Matrix :\ n " , conf_mat )
59 print ( " \ nClassification Report :\ n " , report )
60

61 fpr , tpr , thresholds = roc_curve ( y_test . map ({ ’B ’:0 , ’M ’ :1}) , knn .


predict_proba ( X_test ) [: ,1])
62 roc_auc = auc ( fpr , tpr )
63 print ( " \ nAUC : " , roc_auc )

6
KNN on Breast Cancer Dataset ID: 221-0217-203

64

65 plt . figure ( figsize =(8 ,6) )


66 plt . plot ( fpr , tpr , label = ’ ROC curve ( area = %0.2 f ) ’ % roc_auc ,
color = ’ darkorange ’)
67 plt . plot ([0 , 1] , [0 , 1] , color = ’ navy ’ , linestyle = ’ -- ’)
68 plt . xlabel ( ’ False Positive Rate ’)
69 plt . ylabel ( ’ True Positive Rate ’)
70 plt . title ( ’ Receiver Operating Characteristic ’)
71 plt . legend ( loc = " lower right " )
72 plt . tight_layout ()
73 plt . savefig ( " roc_curve . png " , dpi =1200)
74 plt . show ()
75

76 plt . figure ( figsize =(8 ,6) )


77 sns . heatmap ( conf_mat , annot = True , fmt = " d " , cmap = " Blues " ,
78 xticklabels = sorted ( target . unique () ) ,
79 yticklabels = sorted ( target . unique () ) )
80 plt . title ( " Confusion Matrix \ nKNN - Breast Cancer Diagnosis " )
81 plt . xlabel ( " Predicted " )
82 plt . ylabel ( " Actual " )
83 plt . tight_layout ()
84 plt . savefig ( " confusion_matrix . png " , dpi =1200)
85 plt . show ()
Listing 1: Python Code for Breast Cancer Diagnosis using KNN

7 Conclusion and Future Work


The KNN classifier demonstrates considerable potential in accurately diagnosing breast
cancer. Its inherent simplicity and transparency make it an attractive model, while further
refinements can be pursued by:

• Hyperparameter tuning using cross-validation for optimal performance.

• Integration of more advanced classifiers (e.g., SVM, ensemble methods) for comparative
analysis.

• Implementation of feature selection techniques to identify the most significant predictors.

• Extensive ROC and AUC analysis to further validate the diagnostic capability.

In summary, this ultra pro legendary assignment not only sets a solid foundation for
breast cancer diagnosis using machine learning but also paves the way for innovative
future research and clinical advancements.

Appendix
Additional resources, extended code snippets, and comprehensive experimental logs are
provided for further reference.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy