Aditya Predictive
Aditya Predictive
PREDICTIVE-ANALYSIS
PROJECT-REPORT
Submitted By:
Name: Aditya Sah
Section: K21RM
Reg no: 12111525
Roll No: 27
Dataset Used:
The dataset consists of synthetic data generated to reflect “real-world breast
cancer” cell characteristics.
The main attributes include:
• cell_size: Describes the size of the cells, which can vary between benign and
malignant samples.
• cell_shape: Reflects the shape of the cells, an essential feature as malignant
cells tend to have irregular shapes.
• smoothness: Indicates the smoothness of cell borders, which can differ
significantly in cancerous cells.
• cell_density: Density metric for cell formation.
• symmetry: Symmetry attribute of the cell nuclei.
Since the primary task is predictive analysis, this dataset seems suitable for a
binary classification model predicting the diagnosis based on other cell
characteristics.
Data Preprocessing:
Preprocessing is crucial to ensure that models interpret the data uniformly,
reducing noise and optimizing for accurate predictions. Here are the
preprocessing steps applied to the dataset:
• Normalization: Cell attributes were normalized to rescale the data between
0 and 1. This minimizes the influence of extreme values and ensures each
feature contributes proportionally.
• Encoding: The target variable, diagnosis, was converted into a binary factor
with labels "Benign" and "Malignant." This step simplifies classification,
allowing algorithms to learn the difference between the two classes.
• Splitting: The dataset was split into training and testing subsets. This
method is essential for evaluating model performance, as it allows us to
train models on one subset and test them on another to prevent overfitting.
Algorithms Used:
1. K-Nearest Neighbors (KNN):
KNN is a simple yet effective classification algorithm. It classifies a sample based
on the class of its nearest neighbors. The distance between samples is calculated,
and the majority class among the k-nearest samples determines the classification.
For this project, we used a k value of 21, selected based on exploratory tests to
optimize accuracy.
KNN Workflow
1. Training: The model was trained on normalized data (449 samples).
2. Testing: 100 samples from the test set were used for evaluation.
3. Evaluation: A confusion matrix was generated to examine classification
accuracy, precision, and recall.
KNN's main advantage is simplicity, but it can be sensitive to the choice of k and
the distribution of the data.
2. Linear Regression:
Linear regression is typically used for predictive modeling on continuous data. For
this project, we implemented a linear regression model to predict cell_shape
based on cell_size. This model explores the relationship between these features,
which may help identify trends in cell characteristics that correlate with
malignancy.
Linear Regression Workflow
1. Model Fit: A linear equation is fitted to the data, attempting to minimize the
difference between predicted and actual values of cell_shape.
2. Training and Testing: The training and test sets were used to observe how
well the model generalizes.
3. Visualization: The linear relationship was visualized, showing how
cell_shape varies with cell_size.
Linear regression performs well for simple, linearly-related data. However, it
struggles to capture complex patterns, which led to exploring polynomial
regression.
3. Polynomial Regression:
Polynomial regression extends linear regression by incorporating higher-degree
terms. This approach allows the model to capture non-linear relationships. We
introduced cell_size^2 and cell_size^3 terms to improve the model fit for non-
linear variations in the data.
Polynomial Regression Workflow
1. Model Fit: Polynomial terms were added, producing a more flexible curve
that fits the data more accurately than a straight line.
2. Training and Testing: The polynomial model was tested to verify improved
performance over the linear model.
3. Visualization: We plotted the polynomial regression results, showing a more
nuanced curve that captures non-linearity between cell_size and
cell_shape.
Polynomial regression is beneficial when relationships are complex, as seen in
biological data like this dataset. However, higher-degree terms may lead to
overfitting, so careful tuning is essential.
4. K-Means Clustering:
K-Means clustering is an unsupervised learning algorithm that segments data into
clusters based on similarity. Here, it was used to group data points into clusters to
explore patterns in cell characteristics without relying on labeled data.
K-Means Workflow
1. Elbow Method: This method was used to identify an optimal number of
clusters (6 in this case).
2. Cluster Assignment: Each data point was assigned to one of the six clusters.
3. Visualization: Clusters were plotted to observe segmentation, which
provides insight into potential patterns in cell characteristics across the
dataset (6 in this case).
K-Means clustering is valuable for identifying inherent structure within data,
making it useful in exploring cell attribute patterns without labels.
Conclusion:
In conclusion, the predictive analysis techniques applied here lay a foundation for
further research on cell characteristic analysis, potentially aiding healthcare
professionals in early cancer detection. Future work could explore additional
features, larger datasets, or other advanced algorithms to enhance predictive
accuracy and reliability further.