0% found this document useful (0 votes)
67 views

5 Breast Cancer Model - Ipynb Colab

Uploaded by

anshikarana09925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

5 Breast Cancer Model - Ipynb Colab

Uploaded by

anshikarana09925
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

10/11/24, 6:34 PM breast-cancer-model.

ipynb - Colab

Q1-Imports

import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

df = pd.read_csv('/kaggle/input/breast-cancer-dataset/breast-cancer.csv')
print(df.head())

id diagnosis radius_mean texture_mean perimeter_mean area_mean \


0 842302 M 17.99 10.38 122.80 1001.0
1 842517 M 20.57 17.77 132.90 1326.0
2 84300903 M 19.69 21.25 130.00 1203.0
3 84348301 M 11.42 20.38 77.58 386.1
4 84358402 M 20.29 14.34 135.10 1297.0

smoothness_mean compactness_mean concavity_mean concave points_mean \


0 0.11840 0.27760 0.3001 0.14710
1 0.08474 0.07864 0.0869 0.07017
2 0.10960 0.15990 0.1974 0.12790
3 0.14250 0.28390 0.2414 0.10520
4 0.10030 0.13280 0.1980 0.10430

... radius_worst texture_worst perimeter_worst area_worst \


0 ... 25.38 17.33 184.60 2019.0
1 ... 24.99 23.41 158.80 1956.0
2 ... 23.57 25.53 152.50 1709.0
3 ... 14.91 26.50 98.87 567.7
4 ... 22.54 16.67 152.20 1575.0

smoothness_worst compactness_worst concavity_worst concave points_worst \


0 0.1622 0.6656 0.7119 0.2654
1 0.1238 0.1866 0.2416 0.1860
2 0.1444 0.4245 0.4504 0.2430
3 0.2098 0.8663 0.6869 0.2575
4 0.1374 0.2050 0.4000 0.1625

symmetry_worst fractal_dimension_worst
0 0.4601 0.11890
1 0.2750 0.08902
2 0.3613 0.08758
3 0.6638 0.17300
4 0.2364 0.07678

[5 rows x 32 columns]

df.head()

concave
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean ... radius_worst texture_worst p
points_mean

0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 25.38 17.33

1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 24.99 23.41

2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 23.57 25.53

3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 14.91 26.50

4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 22.54 16.67

5 rows × 32 columns

Q2- Data Analysis

# 1. Understanding the structure of the dataset


print("Dataset Information:")
print(df.info()) # Information about data types and missing values

print("\nSummary Statistics:")
print(df.describe()) # Summary statistics of numerical features

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
https://colab.research.google.com/drive/1A6NDSz1WmmrrWvLv5ija25kkoZoQlEN8#scrollTo=XIq3C7iloen6&printMode=true 1/5
10/11/24, 6:34 PM breast-cancer-model.ipynb - Colab
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
dtypes: float64(30), int64(1), object(1)
memory usage: 142.4+ KB
None

Summary Statistics:
id radius_mean texture_mean perimeter_mean area_mean \
count 5.690000e+02 569.000000 569.000000 569.000000 569.000000
mean 3.037183e+07 14.127292 19.289649 91.969033 654.889104
std 1.250206e+08 3.524049 4.301036 24.298981 351.914129
min 8.670000e+03 6.981000 9.710000 43.790000 143.500000
25% 8.692180e+05 11.700000 16.170000 75.170000 420.300000
50% 9.060240e+05 13.370000 18.840000 86.240000 551.100000
75% 8.813129e+06 15.780000 21.800000 104.100000 782.700000
max 9.113205e+08 28.110000 39.280000 188.500000 2501.000000

smoothness_mean compactness_mean concavity_mean concave points_mean \


count 569.000000 569.000000 569.000000 569.000000
mean 0.096360 0.104341 0.088799 0.048919
std 0.014064 0.052813 0.079720 0.038803
min 0.052630 0.019380 0.000000 0.000000

# 2. Checking for missing values


missing_values = df.isnull().sum()
print("\nMissing values in each column:")
print(missing_values)

Missing values in each column:


id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
dtype: int64

# 3. Analyzing the distribution of the target variable (assume 'diagnosis' as the target)
print("\nTarget variable distribution (Diagnosis):")
print(df['diagnosis'].value_counts())

Target variable distribution (Diagnosis):


diagnosis
B 357
M 212
Name: count, dtype: int64

# 4. Visualizing some important aspects of the dataset


import matplotlib.pyplot as plt
import seaborn as sns

# Select only numeric columns for the correlation matrix


numeric_df = df.select_dtypes(include=[np.number])

# Visualizing the correlation between numerical features


plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

https://colab.research.google.com/drive/1A6NDSz1WmmrrWvLv5ija25kkoZoQlEN8#scrollTo=XIq3C7iloen6&printMode=true 2/5
10/11/24, 6:34 PM breast-cancer-model.ipynb - Colab

# Print the column names to verify


print(df.columns)

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',


'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')

# Generate the boxplot using the correct column name


print(df.columns)
print(df[['diagnosis', 'radius_mean']].head())
sns.boxplot(x='diagnosis', y='radius_mean', data=df)# Replace 'radius_mean' with the actual column name

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',


'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')
diagnosis radius_mean
0 M 17.99
1 M 20.57
2 M 19.69
3 M 11.42
4 M 20.29
<Axes: xlabel='diagnosis', ylabel='radius_mean'>

https://colab.research.google.com/drive/1A6NDSz1WmmrrWvLv5ija25kkoZoQlEN8#scrollTo=XIq3C7iloen6&printMode=true 3/5
10/11/24, 6:34 PM breast-cancer-model.ipynb - Colab
plt.figure(figsize=(6, 4))
sns.countplot(x='diagnosis', data=df) # Change 'df=df' to 'data=df'
plt.title('Count of Diagnosis in Dataset')
plt.show()
print(df.columns)

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',


'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype 'object')

Q3-Data Preprocessing, Assign data and labels, Scaling Data, Splitting Data

from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import train_test_split

# 1. Assign Data (Features) and Labels (Target)


# Assuming 'diagnosis' is the target and the rest are features
# Replace 'diagnosis' with the actual name of the label column
X = df.drop(columns=['diagnosis']) # Features (all columns except 'diagnosis')
y = df['diagnosis'] # Labels (target)

# 2. Encode the labels (if needed)


# If 'diagnosis' is categorical (e.g., 'M' and 'B' for malignant and benign), convert it to numerical labels
y = y.map({'M': 1, 'B': 0}) # Assuming 'M' for malignant and 'B' for benign; adjust if necessary

# 3. Scaling the Data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Standardizing features by removing mean and scaling to unit variance

# 4. Splitting the Data


# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Display shapes of training and testing sets


print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

Training set size: (455, 31)


Test set size: (114, 31)

Q4-Model Implementation

# Importing the Naive Bayes classifier


from sklearn.naive_bayes import GaussianNB

# 1. Initialize the Naive Bayes classifier


model = GaussianNB()

# 2. Train the model using the training data


model.fit(X_train, y_train)

# 3. Make predictions on the test set


y_pred = model.predict(X_test)

# 4. Evaluate the model's performance


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# 5. Print the evaluation metrics


print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

Accuracy: 0.96
Precision: 0.98
Recall: 0.93

Q5-Calculate the accuracy, precision, and recall for your data set

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay

# 1. Evaluate the model's performance


accuracy accuracy score(y test y pred)
https://colab.research.google.com/drive/1A6NDSz1WmmrrWvLv5ija25kkoZoQlEN8#scrollTo=XIq3C7iloen6&printMode=true 4/5
10/11/24, 6:34 PM breast-cancer-model.ipynb - Colab
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# 2. Print the evaluation metrics


print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")

# 3. Calculate and display the confusion matrix


cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Benign (0)', 'Malignant (1)'])
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

Accuracy: 0.96
Precision: 0.98
Recall: 0.93

https://colab.research.google.com/drive/1A6NDSz1WmmrrWvLv5ija25kkoZoQlEN8#scrollTo=XIq3C7iloen6&printMode=true 5/5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy