New Report
New Report
Project Report
on
Cancer Cell Detection using Machine Learning
Bachelor of Technology
in
Computer Science and Engineering
by
Akash Pandey (1809710010)
Ankit Kumar Singh (1809710015)
Nitin Kumar Yadav (1809710064)
CERTIFICATE
This is to certify that the project report entitled “CANCER CELL DETECTION USING
MACHINE LEARNING” submitted by Mr. Akash Pandey(1809710010), Mr. Ankit
Kumar Singh(1809710015), Mr. Nitin Kumar Yadav(1809710064) to the Galgotias
College of Engineering & Technology, Greater Noida, Utter Pradesh, affiliated to Dr. A.P.J.
Abdul Kalam Technical University Lucknow, Uttar Pradesh in partial fulfillment for the
award of Degree of Bachelor of Technology in Computer Science & Engineering is a
bonafide record of the project work carried out by them under my supervision during the year
2021-2022.
i
GALGOTIAS COLLEGE OF ENGINEERING & TECHNOLOGY
GREATERNOIDA, UTTAR PRADESH, INDIA- 2 0 1 3 0 6 .
ACKNOWLEDGEMENT
We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would
like to extend my sincere thanks to all of them.
We are highly indebted to Ms. Tanu Shree for her guidance and constant
supervision. Also, we are highly thankful to them for providing necessary
information regarding the project & also for their support in completing the project.
We also express gratitude towards our parents for their kind co-operation and
encouragement which helped me in completion of this project. Our thanks and
appreciations also go to our friends in developing the project and all the people who
have willingly helped me out with their abilities.
Akash Pandey
Ankit Kumar Singh
Nitin Kumar Yadav
ii
ABSTRACT
KEYWORDS: Ma chin e Learning, bio medi cal , KNN, p redi ction , a pplication,
datas et, SVM
iii
CONTENTS
Title Page
CERTIFICATE i
ACKNOWLEDGEMENT ii
ABSTRACT iii
CONTENTS iv
LIST OF TABLES v
LIST OF FIGURES vi
NOMENCLATURE vii
ABBREVIATIONS viii
CHAPTER1: INTRODUCTION
1.1 Motivation 1
4.1 Introduction 27
4.2 Proposed methodology 30
4.3 Description of each step 30
iv
CHAPTER5: SYSTEM DESIGN
CHAPTER6: IMPLEMENTATION
REFERENCE 66
LIST OF PUBLICATIONS 69
CONTRIBUTION OF PROJECT 70
v
List of Tables
vi
LISTOF FIGURES
1.2 ML Architecture8 6
1.3 ML Models 7
6.1 Dataset 47
vii
NOMENCLATURE
English Symbols
A pre-exponential constant
D Density
viii
ABBREVIATIONS
ix
CHAPTER 1
INTRODUCTION
Cancer is a disease that occurs when some changes or mutations take place in genes
that help in cell growth. These mutations allow the cells to divide and multiply in a very
uncontrolled and chaotic manner. These cells keep increasing and start making replicas
which end up becoming more and more abnormal. These aberrant cells eventually grow
into a tumour. Tumors, unlike other cells, do not die when the body no longer
requires them. Cancer is a disease that arises in cells. Cancer can also occur in the
fatty tissue or the fibrous connective tissue within the body. These cancer cells
become uncontrollable and end up invading other healthy tissues and can travel to the
lymph nodes under the arms. There are two types of cancers, Malignant and Benign.
1.1 Motivation
Malignant cancers are cancerous. These cells keep dividing uncontrollably and start
affecting other cells and tissues in the body. They spread to all other parts of the body
and it is hard to cure this type of cancer. Chemotherapy, radiation therapy, and
immunotherapy are types of treatments that can be given for these types of tumors.
Benign cancer is non-cancerous. Unlike malignant, this tumor does not spread to other
parts of the body and hence is much less risky than malignant. In many cases, such
tumors don’t require any treatment. Cancer is most commonly diagnosed the people
ages above 40. But this disease can affect men and woman of any age. It canals occur
when there’s a family history of cancer. Cancer has always had a high mortality rate
and according to statistics, it alone accounts for about 25% of all new cancer diagnoses
and 15% of all cancer deaths among women worldwide. Scientists know about the
dangers of it from very early on, and hence there’s been a lot of research put into
finding the right treatment for it.
Simulation Cancer is a disease that we hear about a lot nowadays. It is one of the most
widespread diseases. There are around 2000+ new cases of cancer in men each year
and about 2,30,000 new cases in men every year. It is best for a correct and early
diagnosis.
The main objective of this project is to help doctors analyze the huge datasets of cancer
1
data and find patterns with the patient’s data so that cancer data is available. With
this analysis, we can predict whether the patient might have cancer or not.
This dataset analysis will be aided by machine learning algorithms. These techniques
will be used to predict the outcome. The outcome spread whereas cancer cells spread
across the body making it very dangerous. This prediction can help doctors prescribe
different medical examinations for the patients based on the cancer type. The patient
will save a lot of time and money as a result of this.
Supervised learning – Here both the input and output are known. The training dataset
also contains the answer the algorithm should come up with on its own. So, a labeled
data set of fruit images would tell the model which photos were of apples, bananas, and
oranges. When a new image is given to the model, it compares it to the training set to
predict the correct outcome. Supervised learning algorithms build on many
mathematical models where sets of data occupy both the inputs and the resultant
outputs. The data is called training data and contains sets of training examples through
2
repetitive optimization of a particular function; supervised learning algorithms master
the function which is used to guess the result associated with new input.
Unsupervised learning – Here input dataset is known but the output is not known. A
deep learning model is given a dataset without any instructions on what to do with it.
The training data contains information without any correct result. The network tries to
automatically understand the structure of the model.
Reinforcement learning – In this type, AI agents are trying to find the best way to
accomplish a particular goal. It tries to predict the next step which could give the model
the best result at the end.
Classification method: When the output variable is categorical, meaning there are
two classes, such as Yes-No, Male-Female, True-False, etc., classification
methods are utilized.
Spam Filtering
Random Forest
Decision Trees
Logistic Regression
SVM
This learning function, as previously said, classifies the data item into one of several
predetermined classes. Training and generalization mistakes can occur when ML
techniques are used to create a classification model. The former refers to training data
misclassification mistakes, while the latter pertains to predicted testing data errors. An
adequate large informative index that can be parcelled into disjoint prepared and test
sets or exposed to some sensible type of n-overlap cross-approval for smaller
informational collection is a basic requirement for any AI technique.
3
Fig 1.1 Classification Problems [1]
Linear Regression
Regression Trees
Non-Linear Regression
With the help of supervised learning, the model can predict the output based on
prior experiences.
4
If the test data differs from the training dataset, supervised learning
cannot predict the proper output.
5
last experiment, we discovered how to create machine learning supervised classifier
automatically. We used the GP algorithm to address the hyper parameter problem,
which is a difficult problem for machine learning algorithms. From among the many
configurations, the proposed algorithm chose the most appropriate algorithm
6
apps, and more. Machine learning algorithms are being actively implemented by
companies to identify the level of access individuals require in various places
based on their job profiles.
Advantages of ML:
The fundamental goal of machine learning approaches is to create a model that can be
used for classification, prediction, estimation, and other tasks. The most common task
in the learning process is classification. This learning function, as previously said,
classifies the data item into one of several predetermined classes. Training and
generalization mistakes can occur when ML techniques are used to create a
classification model. The former refers to training data misclassification mistakes,
while the latter pertains to predicted testing data errors. A good classification model
should be able to accurately categories all of the examples in the training set.
Over the last decade, interest in machine learning (ML) for healthcare has exploded.
7
Though machine learning has been an academic discipline since the mid-twentieth
century, advances in computing capabilities, data availability, creative approaches, and
a growing pool of technical talent have hastened its application in healthcare. The
academic and general press have focused much of their emphasis on applications of
machine learning in healthcare delivery; however, uses of machine learning in clinical
research are less frequently highlighted (Fig. 1). Clinical research is a broad subject,
with preclinical research and observational studies leading to traditional trials and
pragmatic trials, which in turn encourage clinical registries and more implementation
research. Clinical research as it is now conducted is complicated, despite its importance
in improving healthcare and outcomes. It is time-consuming, costly, and prone to
unanticipated errors and biases, which might jeopardize its successful application,
implementation, and adoption.
The application of machine learning to clinical trials may alter the data collecting,
management, and analysis strategies necessary. ML approaches, on the other hand, can
assist in overcoming some of the challenges connected with missing data and obtaining
real-world data. If the ML model is subverted by purposefully or inadvertently
manipulated sensor data when processing wearable sensor output to extract research
endpoints, the results may be tainted. Other device-related prospects in patient
centricity, aside from the development of novel digital biomarkers, include the capacity
8
to export data and analytics back to participants to facilitate education and insight.
Better defining how previously validated clinical objectives and patient-centric digital
biomarkers overlap, as well as knowing participant attitudes on privacy in relation to
the sharing and use of device data, are all barriers to ML processing of device data
adoption.
Data collection
9
many algorithms is a breakthrough, but it also poses a difficulty in terms of
interpretation and dependability of the models' clinical implications.
The primary goal of machine learning approaches is to create a model that may be used
for classification, prediction, estimation, or any other job. Classification is the most
prevalent activity in the learning process. This learning function, as previously said,
classifies the data item into one of several predetermined classes. Training and
generalisation mistakes can occur when ML techniques are used to create a
classification model. The former refers to training data misclassification mistakes,
while the latter pertains to predicted testing data errors. A good classification model
should closely match the training set and correctly categorise all cases. It's critical to
measure the classifier's performance after obtaining a classification model using one or
more machine learning approaches. Each proposed model's performance is evaluated in
terms of sensitivity, specificity, accuracy, and area under the curve (AUC). The
proportion of true positives successfully identified by the classifier is known as
10
sensitivity, whereas the proportion of true negatives correctly identified is known as
specificity.
SVMs are a more contemporary approach to machine learning methods used in cancer
prediction and prognosis. SVMs first transfer the input vector into a higher-dimensional
feature space and find the hyperplane that divides the data points into two classes. The
marginal distance between the choice hyperplane and the closest-to-the-boundary
occurrences is maximised. The generated classifier has a high degree of generalizability
and can thus be used to reliably classify fresh samples. It's worth mentioning that SVMs
can also produce probabilistic outcomes. A hyperplane would characterise the
detachment if more components were included. The hyperplane is controlled by bolster
vectors, a subset of the two classes' goals. Officially, the SVM calculation produces a
hyperplane that divides the data into two classes, with the most extreme edge meaning
that the distance between the hyperplane and the nearest models (the edge) is increased.
SVMs can be used to perform non-straight order tasks by utilising a non-direct part. A
non-straight piece is a scientific capability that transforms data from direct component
space to non-straight element space. The presentation of an SVM classifier can be
dramatically improved by applying various parts to distinct informational indices.
Data Analysis
Clinical trial data, registries, and clinical practice data are rich sources for hypothesis
creation, risk modeling, and counterfactual simulation, and machine learning is well
suited for these tasks. Unsupervised learning, for example, can discover phenotypic
groupings in real-world data that can be investigated further in clinical trials.
Furthermore, ML has the potential to improve the widely used technique of secondary
trial analysis by more effectively identifying treatment heterogeneity while still offering
some (insufficient) protection against false-positive findings, revealing more viable
areas for future research. Furthermore, machine learning can be utilised to develop risk
predictions in retrospective datasets that can then be confirmed prospectively.
Researchers were able to enhance categorization between patients who will do better or
worse using a random forest model in companion trial data, for example. Researchers
were able to enhance discriminate between patients who would do better or worse
following cardiac resynchronization therapy using a random forest model in companion
trial data compared to a multivariable logistic regression. This highlights random
11
forests' capacity to predict interactions between features that are missed by simpler
models. In conclusion, there are numerous effective machine learning approaches to
clinical trial data administration, processing, and analysis, but there are less techniques
for enhancing data quality as it is generated and collected. Because data availability and
quality are the foundations of machine learning methodologies, conducting high-quality
trials is critical for higher-level ML processing.
12
several frameworks and reproducible research practices have been established.
Transparency is addressed in terms of common code, software dependencies, and
parameters required training a model, allowing the research study to be conducted in a
more transparent manner. Building an ML-based framework for cancer prediction and
classification while dealing with several data modalities, i.e. multimodal frameworks,
provides a new challenge in the field of cancer research. The development of
integrative prediction models by merging the output of many algorithms is a
breakthrough, but it also poses a difficulty in terms of interpretation and dependability
of the models' clinical implications. The lack of consideration devoted to information
size and student approval was one of the most generally recognised flaws identified
among the investigations reviewed in this study. As it stands, there are a variety of
exams with a disorganised test strategy. An adequate large informative index that can
be parcelled into disjoint prepared and test sets or exposed to some sensible type of n-
overlap cross-approval for smaller informational collection is a basic requirement for
any AI technique. The amount of the data isn't the only constraint to effective machine
learning. Quality data sets and careful feature selection are also critical. Data entry and
data verification are crucial when working with huge data sets. Careless data entry can
frequently result in simple off-by-one errors, in which all of the values for a specific
variable in a table are pushed up or down by one row. This is why having a second data
input curator or data checker perform independent verification is usually advantageous.
Additional data integrity verification or spot checks by a competent specialist, not
merely a data entry clerk, is also a beneficial exercise.
13
CHAPTER 2
LITERATURE REVIEW
The motivation behind this research is the rapid growth in cancer incidence and
mortality cases worldwide. The reasons are complicated, but they include population
ageing and expansion, as well as changes in the prevalence and distribution of cancer's
key risk factors. It depicts the cancer incidence cases and death statistics reported by the
American Cancer Society and other reliable resources.
It talks about the current methodology used in the medical sector for cancer prediction.
->Screening
->Chemotherapy
I-based techniques have contributed significantly to the field of cancer research. The
14
research works mentioned in the literature have focussed mainly on deep learning
techniques. Deep learning classifiers have dominated over machine learning models in
the field of cancer research. Among Deep learning models, Convolutional Neural
Networks (CNN) have been used most commonly for cancer prediction; approximately
41% of studies have used CNN to classify cancer. Neural networks (NN) and Deep
Neural Networks (DNN) have also been used extensively in the literature. Ensemble
learning techniques (Random Forest Classifier weighted voting, Gradient Boosting
Machines) and Support vector machines (SVM) are the most commonly utilised in
literature, aside from deep learning approaches.
This graph depicts the distribution of literature based on AI-based prediction models.
Investigation 2: Which cancer site and training data have been explored most
extensively? The majority of the research papers examined in this review focused on
automated cancer prediction diagnosis. The most extensively explored sites are the
breast (followed by the kidney. Most researchers have worked on brain, colorectal,
cervical, and prostate cancer prediction in addition to breast and kidney cancer. It
shows how research works are distributed throughout cancer sites.
The type of data utilised to train the prediction model has a substantial impact on its
performance. The data used to train the classification model has an impact
on the model's dependability and prediction outcomes. Most of the research studies
reviewed in this paper have used Magnetic Resonance Imaging (MRI). The second
most commonly used data is Computed Tomography (CT) scan images.
Other picture types utilised in the literature include dermoscopic, mammographic,
endoscopic, and pathological. It highlights the distribution of papers based on the type
of data used to train the prediction model.
Investigation3: In which year most of the cancer prediction studies have been?
The research works published between 2009 to April 2021 are selected in this review
article It demonstrates the distribution of the articles based on the published year. The
majority of the research papers were published in 2020 (35), 2019 (32), and 2018. (30).
We could only retrieve papers published up until April 2021, hence there are few
papers from 2021.
15
Convolutional Neural Networks models have been used to predict practically every
type of cancer, including brain, colorectal, skin, thyroid, and lungs, due to their
specificity. The majority of studies that looked into breast cancer diagnostic prediction
used hybrid modes or unique methodologies. In addition, neural networks have been
used to practically all datasets related to breast and cervical cancer. Only Convolutional
Neural Networks have been used to study stomach cancer. Support Vector Machines
(SVMs) have been used to forecast of liver and breast cancer.
In a nutshell, Convolutional Neural Networks can be applied with different datasets [1].
Conclusion: This review study attempts to summarize the various research directions
for AI-based cancer prediction models. AI has marked its significance in the area of
healthcare, especially cancer prediction. The paper presents a complete evaluation of
the machine and deep learning models employed in cancer early detection employing
medical imaging, as well as a critical and analytical examination of current state
-of-the-art cancer diagnostic and detection analysis methodologies. Machine and deep
learning techniques for extracting and categorising disease features play a vital role in
early cancer prognosis and diagnosis using AI techniques. Most earlier literature works,
according to our findings, used deep learning approaches, particularly Convolutional
Neural Networks.
Future Scope: As AI & ML are very advance technology so there is a huge scope in
this technique. We have to develop better algorithm to train the object So, it can learn
fast and efficiently.
Using a machine learning system, this work seeks to solve the challenge of automatic
breast cancer detection. The current method is divided into steps. The breast cancer
dataset was used in three distinct projects. In the first test, we demonstrated that with
effective configuration, the three most popular evolutionary algorithms may attain the
same performance. The second experiment looked at how combining different features
selecting approaches enhances accuracy. Finally, in the last experiment, we discovered
how to create a machine learning supervised classifier automatically. We used the GP
algorithm to address the hyperparameter problem, which is a difficult problem for
machine learning algorithms. From among the many configurations, the proposed
algorithm chose the most appropriate algorithm. The Python library was used in all of
the experiments. Despite the fact that the proposed method produced significant results
16
by analysing an ensemble of approaches using an exhaustive machine learning strategy,
we discovered a significantly higher time consumption rate.
Feature selection are also critical. Data entry and data verification are crucial when
Feature extraction is a key stage in the identification of breast cancer since it aids in the
differentiation of benign and malignant tumours. Image attributes such as smoothness,
coarseness, depth, and regularity are extracted via segmentation after extraction. The
goal of this project is to improve the list of data transformations and machine learning
algorithms that will be used to complete the classification transformation. It's
challenging to find the ideal machine learning algorithm and data combination. Genetic
programming (GP) [22] is presented to optimise the data and control parameters of the
proposed model as a result of the increase of hyperparameter tuning. To select the best
combination that leads to the best evaluation findings, this well-known evolutionary
technique must be used. The GP produces a predetermined number of pipelines at
random to make up the population members. Each individual (pipeline) in the
population was assessed based on their fitness, which was used as the categorization
score in this study. Pipelines are being put in place. The pipelines are built using
supervised models from the scikit-learn package. For all classifiers besides linear
discriminant analysis, the hyperparameters optimised in this work are the number of
17
kennels function. The number of kernels is determined
determi at random. The amount of the
data isn't the only constraint to effective machine learning. Quality data sets and careful
feature selection are also critical. Data entry and data verification are crucial when
working with huge data sets. Careless data eentry
ntry can frequently result in simple off-by-
off
one errors, in which all of the values for a specific variable in a table are pushed up or
down by one row. This is why having a second data input curator or data checker
perform independent verification
ation is usual
usually advantageous.
This helps us save a lot of time as well as money for the patient
patient. Challenges faced by
the researchers in the construction of AI
AI-based
based prediction models from the above
literature survey: Limited Data size The most common challenge faced bbyy most of the
studies was insufficient data to train the model. A small sample size implies a smaller
training set which does not authenticate the efficiency of the proposed approaches.
approaches The
paper presents a complete evaluation of the machine and deep learn
learning
ing models
employed in cancer early detection employing medical imaging, as well as a critical
and analytical examination of current state
state-of-the-art
art cancer diagnostic and detection
analysis methodologies [2].
Conclusion: In the above paper the author apply different algorithm on the datasets to
check their accuracy. The result shows that the gradient algorithms are the one’s have
high accuracy than other. Reason behind this is, In gradient we are using the
18
combination of all the other algorithms and get the best one.
Future Scope: From the above results, Its shows that for better accuracy we have to
use gradient algorithms. So, the cross-validation of the results become very important
to get the optimized result.
Cruz JA at. el. in used predictive power of models. Each model's predictive power was
verified in at least three methods. To begin, the models' training was evaluated and
monitored using 20-fold cross-validation. To reduce the stochastic element involved
with sample partitioning, a bootstrap resampling method was used, which involved
performing cross-validation 5 times and averaging the results. Second, the feature
selection procedure was repeated 100 times within each fold to reduce bias in feature
selection (i.e. selecting the most informative subset of SNPs) (5 times for each of the 20
folds). The results were then compared to a random permutation test, which had a
prediction accuracy of 50% at best. While the authors sought to reduce the random
element of sample partitioning, a preferable strategy would have been to employ leave-
one-out cross-validation, which would have totally eradicated this stochastic factor.
19
attention to experimental design and implementation appears to be necessary,
particularly in terms of the quantity and quality of biological data [3].
Conclusion: The above paper talk about the risk of Cancer and how much important it
is to detect it at early stage. The paper shows the results about survival rate when the
cancer get detected at earlier stage & how much fatal it can be when get detected late.
It also focuses on the upcoming development in the field of medical sciences and
collection of data.
The BraTS dataset is a collection of brain tumour MRI scans obtained from various
different locations under standard clinical conditions, but using different equipment and
imaging techniques, resulting in significantly different image quality reflecting diverse
clinical practise across institutions. The following tumour annotation technique, on the
other hand, was created to allow for equivalent ground truth delineations across
different annotators. However, there are non-necrotic, non-cystic patches in high-grade
tumours that do not enhance but may be distinguished from the surrounding vasogenic
edoema and constitute nonenhancing infiltrative tumour. Another issue is the tumour
centre definition in low-grade gliomas. It's tough to tell the difference between tumour
and vasogenic edoema in these circumstances, especially if there's no enhancement.
While individual automated segmentation systems have increased in accuracy, their
robustness remains inferior to expert performance, as measured by inter-rater
agreement. This robustness is predicted to improve as the training set grows in size, as
more diverse patient populations are captured and described, as well as improved
training schemes and ML architectures. Beyond these speculative predictions, our
quantitative analyses reveal that the fusion of segmentation labels from diverse
automated approaches is more resilient than the ground truth inter-rater agreement
(given by clinical professionals) in terms of both accuracy and consistency across
participants. However, the proposed ways to ensemble many models are a viable way to
eliminate outliers and enhance automated segmentation precision. Future research is
critical, according to us, in order to improve the robustness of individual techniques by
improving the ability of segmentation algorithms to manage confounding effects that
are common in images taken through ordinary clinical workflows[4].
Conclusion: The paper looks on the data of BraTS to look after the feature extracting in
20
the MRI, it looks after the segmentation, testing, survival. Basically it checks every
aspect of data to extract information of it. Every year BraTS collects data at different
processes & in this paper author read the data to extract few information.
Future Scope: The collection of data is going to be very important because if there will
be data then we can garner more information from it. It will be very beneficial & can
reduce the work of knowing information everytime rather than storing.
Magnetic resonance imaging is one of the most frequent approaches for detecting brain
tumours (MRI). It provides crucial data for examining the human body's interior
structure in depth. Because of the variety and complexity of brain tumours, MR image
categorization is a difficult endeavour. Sigma filtering, adaptive threshold, and
detection region are among the steps in the suggested technique for detecting a brain
tumour in MR images. Major Axis Length, Euler Number, Minor Axis Length,
Solidity, Area, and Circularity are some of the shape features that are taken into
account when extracting features for MR images. The suggested method employs two
supervised classifiers; the first is the C4.5 decision tree algorithm, and the second is the
C4.5 decision tree algorithm. The suggested method employs two supervised
classifiers: the C4.5 decision tree algorithm and the Multi-Layer Perceptron (MLP)
algorithm. The classifiers are used to categorize brain cases as normal or abnormal;
abnormal brain cases are divided into one type of benign tumour and five types of
malignant tumours. Using the MLP algorithm and 174 samples of brain MR images,
maximum precision of roughly 95% is attained [5].
Conclusion: From above paper we can conclude that the MLP algorithm is better than
C4.5 decision tree algorithm as MLP gives the better accuracy. C4.5 decision tree
algorithm is faster but not precise.
Future Scope: As we already know that MLP is more accurate than C4.5 decision tree
then we always use MLP or algorithm better than MLP.
The authors used the Nave Bayes Classifier, Support Vector Machine (SVM) Classifier,
Bi-clustering and Ada boost Techniques, R-CNN (Convolutional Neural Networks)
Classifier, and Bidirectional Recurrent Neural Networks (HA-BiRNN). This section
21
explains these techniques. RFE and SVM are combined in the SVM Classifier method
[6]. RFE is a recursive strategy for selecting dataset features based on the least feature
value. As a result, in all rounds of SVM-RFE, the incorrect features (lowest weight
feature) are removed. The pre-processed photos are used to extract entropy,
geometrical, and textural properties. Entropy (E) is a haphazardness assessment that is
used to describe the texture of an input image [11]. Shape characteristics serve a crucial
role in distinguishing between normal and cancerous cells. Each picture is divided into
'n' sub squares and quantified in textual features.
The authors present the new DNNS breast cancer detection algorithm. Unlike other
methods, the proposed solution is based on a deep neural network's Support value. A
normalising procedure has been used to improve the performance, efficiency, and
quality of photographs. Experiments have shown that the suggested DNNS is far
superior to existing approaches. The suggested method is guaranteed to be favourable
in terms of performance, efficiency, and image quality, all of which are critical in
today's medical systems [6].
The suggested system describes the breast cancer model and depicts the Support Vector
Machine (SVM) algorithm's execution in classifying breast cancer tumours as benign or
malignant. The dimensionality reduction technique is implemented in the preparation of
data module. Dimensionality reduction is the process of reducing the number of
independent variables to a small number of key variables by eliminating those that
aren't as important in forecasting the outcome. The data is analysed, and a model is
created to predict whether or not the tumour is malignant. It's a binary classification
problem, and the data is checked for accuracy using a few techniques.
The dataset retrieved from the UCI machine learning repository is used to train and test
our proposed system. Radius, texture, perimeter, area, smoothness, compactness,
concavity, concave, concave points, and fractal dimension are among the 10 real-valued
features in the dataset. The cell nucleus of the breast mammography is used to calculate
these properties. The correlation of two variables is the statistical link between them.
With precise and dependable prediction effects, an appropriate breast cancer diagnosis
model can assist scientific practitioners.
22
survival. With 99 percent accuracy, the proposed approach diagnoses tumours as
malignant or benign using attributes extracted from cell pictures. As a result, it can be
utilised as an effective tool for detecting and preventing breast cancer. In this arena, the
integration of multidimensional data with various categorization, feature selection, and
dimensionality reduction approaches can give excellent inference tools. More study
may be done in this area to improve the classification techniques' ability to predict
using different factors.
From this survey we conclude that, most of the automatic cancer predication systems
are based on machine learning concepts including classification and clustering
algorithms. This paper presented an extensive review of various ML classification
techniques for the prediction of cancer and standard datasets have been used in wide
variety of cancer such as brain cancer and breast cancer. A detailed list of results found
by many researchers has been tabulated to solve the problems by various computational
intelligence techniques. The most successful approach is SVM and combination of
SVM technique which gave up to 99% accuracy on a smaller number of training
datasets which is not a good prediction in case with large datasets. However, options
are available for the possibilities of improvement of predicting the cancer at an early
stage. There are many datasets available to explore more for the same. There are large
numbers of cancer types available with unknown functions [7].
23
CHAPTER 3
PROBLEM FORMULATION
Over the years, a continuous evolution related to cancer research has been performed
Scientists used various methods, as early-stage screening, so that they could find
different types of cancer before it could do any damage. With this research, they were
able to develop new strategies to help predict early cancer treatment outcomes. With
the arrival of new technology in the medical field, a huge amount of data related to
cancer has been collected and is available for medical research. But, physicians find the
accurate prediction of the cancer outcome as the most interesting yet challenging part.
For this reason, machine learning techniques have become popular among researchers.
These tools can help discover and identify patterns and relationships between the
cancer data, from huge datasets, , while they can effectively predict future outcomes of
cancer type. Patients have to spend a lot of money on different tests and treatments to
check whether they have breast cancer or not. These tests can take a long time and the
results can be delayed. Also, after confirmation that the patient has cancer, more tests
need to be do ne to check whether the cancer is benign or malignant. In this project, I
will be using different machine learning techniques to analyze the data given in the
datasets. This analysis will help us predict whether the cancer is benign or malignant.
Benign cancer is cancer that doesn’t spread whereas malignant cancer cells spread
across the body making it very dangerous. This dataset analysis will be aided by
machine learning algorithms. These techniques will be used to predict the outcome. The
outcome can be either that the cancer is benign or malignant. Benign cancer is cancer
that doesn’t spread whereas cancer cells spread across the body making it very
dangerous.
This prediction can help doctors prescribe different medical examinations for the
patients based on the cancer type. This helps us save a lot of time as well as money for
the patient. Challenges faced by the researchers in the construction of AI-based
prediction models from the above literature survey: Limited Data size The most
common challenge faced by most of the studies was insufficient data to train the model.
24
A small sample size implies a smaller training set which does not authenticate the
efficiency of the proposed approaches. A good sample size can train the model better
than the limited one. Multiple dimensions High dimensionality is another data-related
challenge in cancer research. When compared to cases, high dimensionality refers to a
large number of features.
Problem of class disparity The uneven distribution of classes in medical data sets,
particularly cancer data, is a major concern. Class imbalance arises due to a miss-match
of the sample size of each class. Computational time About 90% of studies have
endorsed deep learning approaches to predict cancer using medical images than other
techniques. Deep learning-based techniques, on the other hand, are quite complicated..
About 41% of the studies have used the CNN classifier, which has performed
significantly but at the cost of high computational time and space. Clinical
Application Although AI-based models have proven their dominance in cancer
research, their practical deployment in clinics has yet to be implemented. These
models need to be validated in a clinical setting to assist the medical practitioner in
affirming the diagnosis verdicts. Model Generalizability. A shift in research towards
improving the generalizability of the model is required. The majority of studies
have advocated a single-site validation model for prediction. The models must be
validated on numerous sites in order to increase the model's generalizability.
How do we increase the accuracy of cancer cell detection using different algorithms?
Limited Data size is the most common challenge faced by most of the studies was
insufficient data to train the model.
Efficient feature selection technique Many studies have achieved exceptional prediction
outcomes. However, the requirement of a computationally effective feature selection
method is still there to eradicate the data cleaning procedures while generating high
cancer prediction accuracy.
3.4 Objectives
Improving cancer detection and localization: Models can classify as well as locate the
position of cancer cells with high accuracy.
Improving Speed for real-time detection: Building a model which can perform
complex computation like neural network algorithm and position detection within a
limited interval with good speed and accuracy.
26
CHAPTER 4
PROPOSED WORK
4.1 Introduction
In the proposed system we plan on using existing data of cancer patients which has
been collected for several years and run different machine learning algorithms on them.
These algorithms will analyze the data from the datasets to predict whether the patient
has cancer or not and it will also tell us if the cancer is malignant or benign.
It is done by taking the patient’s data and mapping it with the data set and checking
whether there are any patterns found with the data. If a patient has breast cancer, then
instead of taking more tests to check whether the cancer is malignant or benign, ML
can be used to predict the case based on the huge amount of data on breast cancer. This
proposed system helps the patients as it reduces the amount of money they need to
spend just for the diagnosis. Also, if the tumor is benign, then it is not cancerous, and
the patient doesn’t need to go through any of the other tests. This also saves a lot of
time.
The above Description is all about our model, our research paper, and techniques
processing. Now the most important part of any software or application is its user
interface. How an interactive and friendly environment is provided to the user so that its
daily task and to work with it the whole day becomes easier. The fundamental goal
27
of machine learning approaches is to create a model that can be used for classification,
prediction, estimation, and other tasks. The most common task in the learning process
is classification. This learning function, as previously said, classifies the data item
into one of several predetermined classes. Training and generalization mistakes can
occur when ML techniques are used to create a classification model. The former refers
to training data misclassification mistakes, while the latter pertains to predicted
testing data errors. A good classification model should be able to accurately
categories all of the examples in the training set. If the test error rates of a model
begin to increase even though the training error rates decrease then the phenomenon
of model over fitting occurs. This scenario is related to model complexity, which
means that as the model complexity increases, the training errors of the model
will decrease. The least generalization error is the optimum complexity of a model that
is not prone to over fitting. The bias-variance decomposition is a formal method for
assessing a learning algorithm's expected generalization error.
The bias component of a particular learning algorithm measures the error rate of that
algorithm. Additionally, a second source of error over all possible training sets of a
given size and all possible test sets is called the variance of the learning method.
Data-sets: This part of the project includes the research about the available dataset that
can be used for the project. The initial dataset on which we planned to work had certain
flaws. CBIS-DDSM dataset: Neither the analytics nor the experimental validation of
the data was sufficient to move forward with the project. There was no exact data
regarding the classification of tumors into benign and malignant in the case of scan
images of patients. One of the most important demerits of this dataset was the improper
organization of data and huge redundant records which can cause glitches in the
training process. The most common task in the learning process is classification. This
learning function, as previously said, classifies the data item into one of several
predetermined classes.
28
Pre-processing data: Data preprocessing is a data mining technique that is used to filter
data in a usable format. Because the real-world data-set is almost available in different
formats. Because it isn't available in the format we require, it must fit the data set in
an intelligible manner.
Encoder Method: Label Encoder is an efficient tool for encoding the levels of the
categorical features into numeric values. Label Encoder encodes labels with values
between 0 and classes -1. All our categorical features are encoded. In this paper, we
have classified malignant and Benign in diagnosis with 0 and 1. After encoding the
dataset we have applied a neural network on this dataset and achieved accuracy.
radius
texture
perimeter
area
smoothness
compactness
concavity
concave points
symmetry
fractal dimension
29
Table 4.1 Measuring Factors Table
30
4.2 Proposed Methodology
4.3 Algorithms
A effective strategy for both classification and regression is to allocate weight to the
contributions of the neighbors, so that the closer neighbors contribute more to the
average than the farther ones. For example, a common weighting scheme consists in
giving each neighbor a weight of 1/d, where d is the distance to the neighbor. When
KNN issued for classification, the output can be calculated as the class with the highest
frequency from the K-most similar instances. Each instance in essence votes for their
class and the class with the most votes is taken as the prediction. If you are using K and
you have an even number of classes (e.g. 2) it is a good idea to choose the K value with
an odd number to avoid a tie. And the inverse, use an even number for K when you
have an odd number of classes. Expanding K by and looking at the class of the next
most similar case in the training dataset can reliably break ties.
Our forecasts become less stable as we reduce the value of K to one. Inversely, as we
31
increase the value of K, our predictions become more stable due to majority voting /
averaging, and thus, more likely to make more accurate predictions (up to a certain
point). Eventually, we begin to witness an increasing number of errors. It is at this point
we know we have pushed the value of K too far. In cases where we are taking a
majority vote (e.g. picking the mode in a classification problem) among labels, we
usually make K an odd number to have a tiebreaker.
Advantages:
Disadvantages:
The algorithm gets significantly slower as the number of examples and prediction
dependent variables increase.
32
NAIVE BAYES
Naive Bayes classifier is based on Bayes’ theorem and is one of the oldest approaches
The goal is to figure out how likely something is to happen. A occurs when B
occurs. The naive Bayes classifier combines Bayes’ model with decision rules like the
hypothesis which is the most probable outcome. The "naive" assumption of conditional
independence between any pair of features given the value of the class variable
underpins a collection of supervised learning algorithms known as nnaive
aive Bayes
methods. It was created for text categorization problems and is still used as a
benchmark today.
Pros:
Predicting the test data set's class is simple and quick. It's also good at multi
multi-
class prediction.
Real-time
time Prediction: Naive Baye
Bayes is an eager learning classifier and it is super fast.
Thus, it could be used for making predictions in real-time.
Multi-class
class Prediction: This algorithm is also well known for its multi-class
multi prediction
feature. Here we can predict the probability of multiple classes of target variable
variable.
33
variable(s). The normal distribution is assumed for numerical variables (bell curve,
which is a strong assumption).
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly
used in text classification (due to better results in multi-class problems independence
rule) have a higher success rate as compared to other algorithms. As a result, it's
commonly utilised in spam filtering and sentiment analysis (in social media analysis,
to identify positive and negative customer sentiments).
For the parameters, choose a random starting value. (To be clear, differentiate "y" from
34
“x"
x" in the parabolic example.) If there are more characteristics such as x1, x2, and so
on, we compute the partial derivative of "y" with regard to each one).
Fi
Fig 4.4 SGD Graph
Step size = gradient * learning rate = step size for each feature
The new parameters are calculated as follows: new params = old params - step size
Mini-batch
batch Gradient Descent
The steps taken towards the loss function minima have oscillations as a result of
frequent updates, which can help get out of local minimums (in case the
computed position turns out to be the local minimum)
3. SVM
SVM stands for Support Vector Machine and is one of the most widely used
Supervised Learning algorithms for Classification and Regression issues. However, it is
mostly used in Machine Learning to solve classification problems.
The goal of the SVM method is to discover the best line or decision boundary for
categorising n-dimensional space into classes so that subsequent data points can
be easily placed in the right category. The ideal choice boundary is known as
a hyperplane. SVM is used to select the the SVM calculation produces a hyperplane
that divides the data into two classes, with the most extreme edge meaning that the
distance between the hyperplane and the nearest models (the edge) is increased. SVMs
can be used to perform non-straight order tasks by utilising a non-direct part extreme
points/vectors that help build the hyperplane. The algorithm is known as a Support
Vector Machine, and support vectors are the extreme examples.. Consider the picture
below, which shows how two distinct categories are identified using a decision
boundary.
36
Fig 4.5 SVM Linear
To begin implementing kernel functions, use the command prompt terminal to install
the "scikit-learn" library.
The Gaussian Kernel is used to convert data when no prior knowledge of the data
exists.
37
The Gaussian Kernel Radial Basis Function (RBF) is the same as the preceding kernel
function, but with the addition of the radial basis approach to improve the
transformation.
Polynomial Kernel: A polynomial kernel shows the similarity of vectors in the training
set of data in a feature space over polynomials of the original variables utilised in the
kernel. The library focuses on data modelling. It is not designed to load, manipulate, or
summarise data. Refer to NumPy and Pandas for these functionalities. In Python,
Scikit-learn (Sklearn) is the most usable and robust machine learning library. It uses a
Python consistency interface to deliver a set of efficient machine learning and statistical
modelling capabilities, such as classification, regression, clustering, and dimensionality
reduction.
38
CHAPTER 5
SYSTEM DESIGN
In the previous chapter the challenges are discussed that are faced during data analysis
and the potential solutions. Now in this chapter the process followed to design the
system is being discussed. It defines the view of the project as seen on the levels of
architecture, modules, data and the user.
Cancer dataset
Pre-
Pre-processing
Feature Selection
Feature Selection
DataPartition
Data Partition
Classification
Classification
Cancerous Non-Cancerous
39
5.2 Structural and Dynamic Modeling of System*
Use case diagram basically describes the high level functions and scope of a system. It
also identifies the interactions between the systems and its actors. The use cases and
actors in the use case diagram describe what the system does and how the actors use it,
but not how the system operates internally.
40
5.2. State Chart /Activity Diagram
41
5.2.3 Component /Deployment Diagram
42
CHAPTER 6
IMPLEMENTATION
SVM: SVM stands for Support Vector Machine and is one of the most widely used
Supervised Learning algorithms for Classification and Regression issues. However, it is
mostly used in Machine Learning to solve classification problems.
The goal of the SVM method is to discover the best line or decision boundary for
categorising n-dimensional space into classes so that subsequent data points can be
easily placed in the right category. The ideal choice boundary is known as a
hyperplane. SVM is used to select the extreme points/vectors that help build the
hyperplane. The algorithm is known as a Support Vector Machine, and support vectors
are the extreme examples. Consider the picture below, which shows how two distinct
categories are identified using a decision boundary.
KNN: KNN isused for classification, the the output can be calculated as the class with
the highest frequency from the K-most similar instances. In essence, each instance
votes for their class, and the class with the most votes is deemed the winner.
If you're working with K and an even number of classes, (e.g. 2) it is a good idea to
choose the K value with an odd number to avoid a tie. And the inverse,use an even
number for K when you have an odd number of classes.
SGD: SGD (Stochastic Gradient Descent) is a fast and simple method for fitting
linear classifiers and regressors to convex loss functions such as (linear) Support Vector
Machines and Logistic Regression. Despite the fact that SGD has been present for a
long time in the machine learning field, it has only lately gotten a lot of attention in the
context of large-scale learning. SGD has been used to solve large-scale, sparse machine
learning issues that are common in text categorization and natural language processing.
Because the data is sparse, the classifiers in this module can easily scale to problems
with more than 105 training samples and features.
43
6.1.2 Software tools used
Anaconda is useful because it comes with Python and roughly 200 more Python
packages. All of the other programmes are free to use. Many of the most common
Python packages for solving problems are included in Anaconda's packages. Anaconda
is a Python distribution that includes The Standard Library as well as 200 more
packages. Python and The Standard Library are the only modules available when you
download Python from Python.org. You could install the Anaconda extra modules
(which aren't included with plain old Python), but why not save a step (or 200)
and just download one item (Anaconda) instead of 201 and one things (200
extra).[18]
Additional Libraries
Python-based toolkit
Installation on windows:
44
Tensor Flow
TensorFlow is a free and open-source software library for machine learning and
artificial intelligence. It can be used across a range of tasks but has a particular focus on
training and inference of deep neural networks.
Installation on windows:
Matplotlib:
Matplotlib is a free software machine learning library for the Python programming
language. It features various classification, regression and clustering algorithms
including support vector machines.
Numpy:
NumPy is a Python package that allows you to interact with arrays. It also has matrices,
Fourier transforms, and linear algebra functions.
NumPy was created by Travis Oliphant in 2005. You are free to use it because it is
an open source project. Unlike lists, NumPy arrays are kept in a single continuous
area in memory, making it easy for programmes to access and modify them.
This is known as locality of reference in computer science. This is the main reason
why NumPy performs better than lists. It's also been updated to support the most
recent CPU architectures..
Seaborn:
45
6.2 Dataset Description
Unzip the compressed data files and store in the format as mentioned :
import data_utils
data_utils.download_files()
The following file is corrupted which gives an error when being loaded. Delete it before
proceeding.
46
6.2.1 Size
ze (No. of Samples) and description of attributes
c) area
d) perimeter
47
The output of SGD code also depicts a couple of graphical results:
Fig 6.3 The output specifications obtained on compiling the SGD code
Fig 6.4 The output specifications obtained on compiling the SVM (Linear) code
48
The output of SVM (linear) also depicts two graphs one of confusion matrix and the
other of ROC curve. The accuracy of SVM (linear) is compa
comparatively
ratively more than that of
SGD.
Fig 6.5 The output specifications obtained on compiling the SVM (Gaussian) code
The output of SVM (Gaussian) also depicts two graphs one of confusion matrix and the
other of ROC curve. The accuracy of SVM (Gaussian) is
is comparatively more than that
of SGD and SVM (linear).
0 (zero) in the figure 6.6 represents ‘B’ that is Benign and 1 (one) represents ‘M’ that
is malignant.
50
CHAPTER 7
RESULT ANALYSIS
Metrics:
Once the model has been trained on the training data, its performance will
be evaluated
luated using the test data.
Accuracy - will be used for evaluating the performance of the model on the test
data.
Confusion Matrix - will be used in order to compare the model with the Benchmark
model.
51
A classification model's
odel's performance is described using a confusion matrix.
Receiver operating characteristic curve or the ROC curve is a graph showing the
performance
mance of a classification model at all classification threshold. The accuracy of a
ROC curve in the SGD algorithm is 94.15%. The precision of the SGD is 97% and the
area under the curve is approximately 97% which means it can be considered as a
potential algorithm
lgorithm for disease prediction.
52
Confusion matrix is a table that is used to define the performance of a classification
algorithm. Confusion matrix of SGD algorithm is inclined towards positive
sitive prediction
side thereby making SGD a good algorithm for classification and prediction.
The accuracy of the ROC curve of SVM (linear) is 96.49% and the area under the curve
is approximately 97%. SVM with linear kernel is considered to be a better algorithm in
comparison to SGD algorithm.
53
Confusion
usion matrix of SVM algorithm with linear kernel is inclined mostly towards
positive prediction side thereby making SVM with linear kernel a great algorithm for
classification and prediction problems.
The accuracy of the ROC curve of SVM (linear) is 97.66% and the area under the curve
is approximately 98%. SVM with Gaussian kernel is considered to be a better algorithm
in comparison to SGD algorithm and SVM with linear kernel.
54
Confusion matrix of SVM algorithm with Gaussian kernel is inclined mostly towards
positive prediction side thereby making SVM with linear kernel a great algorithm for
classification and prediction problems
55
CHAPTER 8
8.1 Conclusion
The research makes it easier to analyze the cancer cell detection in least time
consuming and efficient and cost saving way. This model also shows how to
avoid complex calculations and find the most efficient way to analyze
cancer detection.
8.2 Limitation
The model is able to classify the cancerous & non-cancerous but its accuracy can be
improved if more dataset can be fed into it. Preprocessing of the data may take high
system requirements. It may result out into memory error so it is preferred to use better
system for the result. We have not optimize it by cross-validation which may also help
it in improving the detection.
In hospitals, provide an app-based user interface that allows clinicians to quickly assess
the impact of a tumour and make treatment recommendations. We can try to forecast
the location and stage of the tumour from Volume based 3D images because the
performance and complexity of ConvNets are dependent on the input data
representation. Training, planning, and computer guidance during surgery are all
improved by building three-dimensional (3D) anatomical models from specific patients.
Because the algorithms are still evolving, there is a probability that cancer cell
detection will improve at an earlier stage. Because the number of cancer patients is
increasing every day, cancer cell identification has a lot of potential.
56
REFERENCES
[1] Kumar, Y., Gupta, S., Singla, R. et al. A Systematic Review of Artificial
Intelligence Techniques in Cancer Prediction and Diagnosis. Arch Computational
Methods Eng (2021).
[2] Habib Dhahri, Eslam Al Maghayreh, Awais Mahmood, Wail Elkilani, Mohammed
Faisal Nagi, "Automated Breast Cancer Diagnosis Based on Machine Learning
Algorithms", JournalofHealthcareEngineering, vol. 2019, ArticleID 4253641, 11 pages
, 2019.
[3] Cruz JA, Wishart DS. Applications of Machine Learning in Cancer Prediction and
Prognosis. Cancer Informatics. January 2006. doi:10.1177/117693510600200030
[6] Anji Reddy Vaka, Badal Soni, Sudheer Reddy K., Breast cancer detection by
leveraging Machine Learning, ICT Express, Volume 6, Issue 4, 2020, Pages 320-324,
ISSN 2405-9595,
[7] Deepika S and Kapilaa Ramanathan Devi N (2021) Prediction of Breast Cancer
Using SVM Algorithm ISSN 0973-4562 Volume 16, Number 4 (2021) pp. 316-320
[8] Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and
prognosis. Cancer Inform. 2007;2:59-77. Published 2007 Feb 11.
[9] Cruz, Joseph A, and David S Wishart. “Applications of machine learning in cancer
prediction and prognosis.” Cancer informatics vol. 2 59-77. 11 Feb. 2007
[10] Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and
prognosis. Cancer Inform. 2007 Feb 11;2:59-77. PMID: 19458758; PMCID:
PMC2675494.
[11] Barnali Sahu, Debahuti Mishra, ―A Novel Feature Selection Algorithm using
Particle Swarm Optimization for Cancer Microarray Data‖, International Conference on
Modeling Optimization and Computing (ICMOC-2012), ELSEVIERProcedia
Engineering 38 (2012 ) pp 27–31.
57
Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),
Orlando, FL, USA, 2016, pp. 2440–2443.
[13] Cuong Nguyen, Yong Wang, Ha Nam Nguyen,‖ Random forest classifier combined
with feature selection for breast cancer diagnosis and prognostic‖, J.Biomedical Science
and Engineering, 2013, 6, pp 551-560,
[16] Ahmad LG, Eshlaghy AT, Poorebrahimi A, Ebrahimi M, Razavi AR, ―Using
Three Machine Learning Techniques for Predicting Breast Cancer Recurrence‖,
Open Access, Journal of Health & Medical Informatics 2013, vol 4, issue 2. ISSN:
2157-7420, http://dx.doi.org/10.4172/2157- 7420.1000124
[18] Mehdi Pirooznia, Jack Y Yang, Mary Qu Yang, Youping Deng, ―A comparative
study of different machine learning methods on microarray gene expression data‖,
BMC Genomics, Open Access BioMed Central, 2008, International Conference on
Bioinformatics & Computational Biology (BIOCOMP'07) Las Vegas, NV, USA. 25-28
June 2007, DOI: 10.1186/1471-2164-9-S1-S13.
[20] Vikas Chaurasia, Saurabh Pal, ―Data Mining Techniques: To Predict and
Resolve Breast Cancer Survivability‖, International Journal of Computer Science and
Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 10-22, ISSN 2320–088X
[24] Kar, Subhajit, Kaushik Das Sharma, and Madhubanti Maitra, ―A particle swarm
optimization based gene identification technique for classification of cancer
subgroups,‖ in 2nd IEEE International Conference on Control, Instrumentation, Energy
and Communication (CIEC), 2016.
[25] Hiba Asria, Hajar Mousannifb, Hassan Moatassimec, Thomas Noeld, ―Using
Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis‖,
ELSEVIER 6th International Symposium on Frontiers in Ambient and Mobile Systems
(FAMS 2016), Procedia Computer Science 83 ( 2016 ) pp 1064–1069.
59
LIST OF PUBLICATIONS
[I] Akash Pandey, Ankit Kumar Singh and Nitin Kumar Yadav(2022) Cancer Cell
detection using machine learning[accepted in] 4th IEEE ICAC3N-22
60
CONTRIBUTION OF PROJECT
Cancer Cell detection is one of the most important and practical issues in applications
that implement pattern recognition of cancerous cell. SVM is the process of recognizing
user-defined infected cells and classifying them into the pre-defined datasets. It has
various key applications such as disease prediction, diagnosis of cancerous cells, etc.
Due to its relevance in allowing one to understand the worldwide renowned machine
learning algorithms, the rapid use of this technology can help solve a lot of real-world
complex problems in real time.
So the objective of this project is to create a Cancer cell detection alongside a model
that is accurate enough to predict benign cancer cells in real-time provided within the
model efficiently with accuracy and robustness.
2. Expected Outcome
The expected outcome of this project is to create a model application using the standard
Tkinter python library. The model should consist of a training datasets to draw benign
cancer cells in real-time and several cancers that should be bound to various
functionalities that are triggered according to user events.
The proposed machine learning model will be using the MNIST dataset and data will be
preprocessed before being fed into the model to provide better accuracy.
Cancer cell detection is one of the significant areas of research and development with a
streaming number of possibilities that could be attained.
A Cancer cell detection using machine learning is socially relevant because it has
numerous applications such as disease prediction, diagnosis of cancerous cells, etc. This
helps society by providing the facility of cancer prediction in real-time for various
61
specific applications. On the other hand, implementing this technology on a large scale
could bring about several repercussions.
Cancer Cell Detection using Machine Learning doesn’t necessarily pose any health
concerns. The basic aim behind this project is to formulate and use a machine-learning
algorithm to automate the process of cancer prediction. There is no direct impact or
health concern related to this project. It may have an indirect impact on the mental
health of a person in a good sense since using this model provides a sense of
satisfaction and may improve one’s efficiency ultimately saving time and manpower.
Here, our Cancer cell detection using machine learning uses the MNIST which is a free
dataset. It uses python as its core programming language which is a free & open-source
language. Libraries used for this project (TensorFlow, Keras, Tkinter) are also free and
open source. Hence, this project is completely legal and a risk management plan will be
developed to completely tackle all the obstacles faced during the completion of this
project. Also, several laws may need to be implemented by the government alongside
spreading awareness to introduce the incorporation of a Cancer cell detection
worldwide as it may be used by malicious people to cause felonies.
62