0% found this document useful (0 votes)
59 views73 pages

New Report

This project report discusses detecting cancer cells using machine learning. It was submitted by three students to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The report provides an introduction to machine learning concepts and reviews literature on applying machine learning techniques like artificial neural networks, Bayesian networks, support vector machines, and decision trees to cancer research. It then formulates the problem statement of detecting cancer cells from a dataset and proposes a methodology using K-nearest neighbors, stochastic gradient descent, and support vector machines with linear kernels. The system design includes use case diagrams, state diagrams, and component diagrams. The implementation details the experimental setup and dataset description. Results are analyzed using performance measures like confusion matrices. The conclusion discusses limitations and opportunities
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views73 pages

New Report

This project report discusses detecting cancer cells using machine learning. It was submitted by three students to fulfill the requirements for a Bachelor of Technology degree in Computer Science and Engineering. The report provides an introduction to machine learning concepts and reviews literature on applying machine learning techniques like artificial neural networks, Bayesian networks, support vector machines, and decision trees to cancer research. It then formulates the problem statement of detecting cancer cells from a dataset and proposes a methodology using K-nearest neighbors, stochastic gradient descent, and support vector machines with linear kernels. The system design includes use case diagrams, state diagrams, and component diagrams. The implementation details the experimental setup and dataset description. Results are analyzed using performance measures like confusion matrices. The conclusion discusses limitations and opportunities
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

A

Project Report
on
Cancer Cell Detection using Machine Learning

Submitted in partial fulfillment of the requirements


for the award of the degree of

Bachelor of Technology
in
Computer Science and Engineering
by
Akash Pandey (1809710010)
Ankit Kumar Singh (1809710015)
Nitin Kumar Yadav (1809710064)

Under the Supervision of


Ms. Tanu Shree

Galgotias College of Engineering & Technology


Greater Noida, Uttar Pradesh
India-201306
Affiliated to

Dr. A.P.J. Abdul Kalam Technical University


Lucknow, Uttar Pradesh,
India-226031
May, 2022
GALGOTIAS COLLEGE OF ENGINEERING & TECHNOLOGY
GREATER NOIDA, UTTAR PRADESH, INDIA- 2 0 1 3 0 6 .

CERTIFICATE

This is to certify that the project report entitled “CANCER CELL DETECTION USING
MACHINE LEARNING” submitted by Mr. Akash Pandey(1809710010), Mr. Ankit
Kumar Singh(1809710015), Mr. Nitin Kumar Yadav(1809710064) to the Galgotias
College of Engineering & Technology, Greater Noida, Utter Pradesh, affiliated to Dr. A.P.J.
Abdul Kalam Technical University Lucknow, Uttar Pradesh in partial fulfillment for the
award of Degree of Bachelor of Technology in Computer Science & Engineering is a
bonafide record of the project work carried out by them under my supervision during the year
2021-2022.

Ms. Tanu Shree (Project Guide) Dr. Vishnu Sharma


Assistant Professor Professor and Head
Dept. of CSE Dept. of CSE

i
GALGOTIAS COLLEGE OF ENGINEERING & TECHNOLOGY
GREATERNOIDA, UTTAR PRADESH, INDIA- 2 0 1 3 0 6 .

ACKNOWLEDGEMENT
We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals and organizations. We would
like to extend my sincere thanks to all of them.

We are highly indebted to Ms. Tanu Shree for her guidance and constant
supervision. Also, we are highly thankful to them for providing necessary
information regarding the project & also for their support in completing the project.

We are extremely indebted to Dr. Vishnu Sharma, HOD, Department of Computer


Science and Engineering, GCET and Dr. Jaya Sinha, Project Coordinator,
Department of Computer Science and Engineering, GCET for their valuable
suggestions and constant support throughout my project tenure. We would also like
to express our sincere thanks to all faculty and staff members of Department of
Computer Science and Engineering, GCET for their support in completing this
project on time.

We also express gratitude towards our parents for their kind co-operation and
encouragement which helped me in completion of this project. Our thanks and
appreciations also go to our friends in developing the project and all the people who
have willingly helped me out with their abilities.

Akash Pandey
Ankit Kumar Singh
Nitin Kumar Yadav

ii
ABSTRACT

Cancer has been characterized as consisting of a heterogeneous disease many


different subtypes. The early diagnosis and prognosis of a cancer type have become
a necessity in cancer research, as it can facilitate the subsequent clinical management
of patients. The importance of classifying cancer patients into high or low-risk groups
has led many research teams, from the biomedical and the bioinformatics field, to
study the application of machine learning (ML) method. Therefore, these techniques
have been utilized as an aim to model the progression and treatment of cancerous
conditions. In addition, the ability of ML tools to detect key features from complex
datasets reveals their importance. A variety of these techniques, including Artificial
Neural Networks (ANNs), Bayesian Networks (BNs), Support Vector Machines
(SVMs), and Decision Trees (DTs) have been widely applied in cancer research for
the development of predictive models, resulting in effective and accurate decision
making Even though it is evident that the use of ML methods can improve our
understanding of cancer progression, an appropriate level of validation is needed for
these methods to be considered in everyday clinical practice. In this work, we present
a review of recent ML approaches employed in the modeling of cancer progression.
The predictive models discussed here are based on various supervised ML techniques
as well as on different input features and data samples. Given the growing trend
in the application of ML methods in cancer research, we present here the most recent
publications that employ these techniques as an aim to model cancer risk or patient
outcomes.

KEYWORDS: Ma chin e Learning, bio medi cal , KNN, p redi ction , a pplication,
datas et, SVM

iii
CONTENTS

Title Page

CERTIFICATE i
ACKNOWLEDGEMENT ii
ABSTRACT iii
CONTENTS iv
LIST OF TABLES v
LIST OF FIGURES vi
NOMENCLATURE vii

ABBREVIATIONS viii

CHAPTER1: INTRODUCTION

1.1 Motivation 1

1.2 Description of Theoretical concepts 2

CHAPTER2: LITERATURE REVIEW

2.1 Related Literature Review 14

CHAPTER3: PROBLEM FORMULATION

3.1 Description of problem domain 24


3.2 Problem statement 25
3.3 Description of problem statement 26
3.4 Objective 26

CHAPTER4: PROPOSED WORK

4.1 Introduction 27
4.2 Proposed methodology 30
4.3 Description of each step 30

iv
CHAPTER5: SYSTEM DESIGN

5.1 Functional specific system 39


5.2.1 Use case diagram 40
5.2.2 State diagram 41
5.2.3 Component diagram 42

CHAPTER6: IMPLEMENTATION

6.1 Experimental setup 43


6.2 Dataset description 47

CHAPTER7: RESULT ANALYSIS

7.1 Performance measures 60

7.2 Performance analysis 64

CHAPTER8: CONCLUSION, LIMITATION, AND FUTURE SCOPE 65

REFERENCE 66

LIST OF PUBLICATIONS 69

CONTRIBUTION OF PROJECT 70

v
List of Tables

Table Title Page

3.1 Description of features used in the dataset 22

vi
LISTOF FIGURES

Figure Title Page

1.1 Classification Problems 3

1.2 ML Architecture8 6

1.3 ML Models 7

2.1 Graphical representation of different cancers 8

2.2 Comparison of Classifier Accuracy 11

3.1 Data Flow Diagram 19

4.1 Flow diagram of Proposed work 28

4.2 Cancerous & Non-Cancerous 29

4.3 Pictorial representation of K-Nearest Algorithm 29

4.4 SGD Graph 29

4.5 SVM linear 30

5.1 Function specification of system 39

5.2 Use-case Diagram 40

5.3 Activity Diagram 41

5.4 Component Diagram 42

6.1 Dataset 47

6.2 Features of Dataset 56

7.1 Confusion matrix for dataset 61

vii
NOMENCLATURE
English Symbols

A pre-exponential constant

D Density

cos trigonometric function

viii
ABBREVIATIONS

SVM Support Vector Machine


CNN Convolutional Neural Networks
KNN K-Nearest Neighbour
SGD Stochastic Gradient Descent

ix
CHAPTER 1

INTRODUCTION

Cancer is a disease that occurs when some changes or mutations take place in genes
that help in cell growth. These mutations allow the cells to divide and multiply in a very
uncontrolled and chaotic manner. These cells keep increasing and start making replicas
which end up becoming more and more abnormal. These aberrant cells eventually grow
into a tumour. Tumors, unlike other cells, do not die when the body no longer
requires them. Cancer is a disease that arises in cells. Cancer can also occur in the
fatty tissue or the fibrous connective tissue within the body. These cancer cells
become uncontrollable and end up invading other healthy tissues and can travel to the
lymph nodes under the arms. There are two types of cancers, Malignant and Benign.

1.1 Motivation

Malignant cancers are cancerous. These cells keep dividing uncontrollably and start
affecting other cells and tissues in the body. They spread to all other parts of the body
and it is hard to cure this type of cancer. Chemotherapy, radiation therapy, and
immunotherapy are types of treatments that can be given for these types of tumors.
Benign cancer is non-cancerous. Unlike malignant, this tumor does not spread to other
parts of the body and hence is much less risky than malignant. In many cases, such
tumors don’t require any treatment. Cancer is most commonly diagnosed the people
ages above 40. But this disease can affect men and woman of any age. It canals occur
when there’s a family history of cancer. Cancer has always had a high mortality rate
and according to statistics, it alone accounts for about 25% of all new cancer diagnoses
and 15% of all cancer deaths among women worldwide. Scientists know about the
dangers of it from very early on, and hence there’s been a lot of research put into
finding the right treatment for it.

Simulation Cancer is a disease that we hear about a lot nowadays. It is one of the most
widespread diseases. There are around 2000+ new cases of cancer in men each year
and about 2,30,000 new cases in men every year. It is best for a correct and early
diagnosis.

The main objective of this project is to help doctors analyze the huge datasets of cancer
1
data and find patterns with the patient’s data so that cancer data is available. With
this analysis, we can predict whether the patient might have cancer or not.
This dataset analysis will be aided by machine learning algorithms. These techniques
will be used to predict the outcome. The outcome spread whereas cancer cells spread
across the body making it very dangerous. This prediction can help doctors prescribe
different medical examinations for the patients based on the cancer type. The patient
will save a lot of time and money as a result of this.

1.2 Theoretical Concepts

Machine Learning: Machine Learning is a subcategory of Artificial Intelligence that


allows systems to automatically learn and understand data from experience without the
system being programmed to do so. It helps software applications become better at
predicting outcomes for various types of problems. The basic idea of ML is to take in
input data and use different algorithms to help it predict outcomes and also update these
outcomes when new data is available as input. The procedures used with machine
learning are like that of data mining and predictive modeling. Both require scanning
through huge amounts of data to search for any type of pattern in the data and then
modify the program accordingly. Machine Learning has been seen many times by
individuals while shopping on the internet. They are then shown ads based on what they
were searching for earlier on. This happens because many of these websites use
machine learning to customize the ads based on user searches and this is done in real-
time. Machine learning has also been used in other various places like detecting fraud,
filtering spam, network security threat detection, predictive maintenance, and building
news feeds.

Machine Learning methods:

Supervised learning – Here both the input and output are known. The training dataset
also contains the answer the algorithm should come up with on its own. So, a labeled
data set of fruit images would tell the model which photos were of apples, bananas, and
oranges. When a new image is given to the model, it compares it to the training set to
predict the correct outcome. Supervised learning algorithms build on many
mathematical models where sets of data occupy both the inputs and the resultant
outputs. The data is called training data and contains sets of training examples through

2
repetitive optimization of a particular function; supervised learning algorithms master
the function which is used to guess the result associated with new input.

Unsupervised learning – Here input dataset is known but the output is not known. A
deep learning model is given a dataset without any instructions on what to do with it.
The training data contains information without any correct result. The network tries to
automatically understand the structure of the model.

Semi-supervised learning – This sort of learning is a hybrid of supervised and


unsupervised learning. It includes both labeled and unmarked information.

Reinforcement learning – In this type, AI agents are trying to find the best way to
accomplish a particular goal. It tries to predict the next step which could give the model
the best result at the end.

Examples of Supervised Learning:

Classification method: When the output variable is categorical, meaning there are
two classes, such as Yes-No, Male-Female, True-False, etc., classification
methods are utilized.

Below is some popular classification algorithm -

 Spam Filtering

 Random Forest

 Decision Trees

 Logistic Regression

 SVM

This learning function, as previously said, classifies the data item into one of several
predetermined classes. Training and generalization mistakes can occur when ML
techniques are used to create a classification model. The former refers to training data
misclassification mistakes, while the latter pertains to predicted testing data errors. An
adequate large informative index that can be parcelled into disjoint prepared and test
sets or exposed to some sensible type of n-overlap cross-approval for smaller
informational collection is a basic requirement for any AI technique.

3
Fig 1.1 Classification Problems [1]

Regression: If there is a relationship between the input and output variables,


regression procedures are applied. It has numerous uses, including weather
forecasting, market trends, and so on.

Here are some popular supervised learning regression algorithms:

 Linear Regression

 Regression Trees

 Non-Linear Regression

 Bayesian Linear Regression

Advantages of Supervised learning:

 With the help of supervised learning, the model can predict the output based on
prior experiences.

 We can have a precise concept about the classes of objects in supervised


learning. Supervised learning models help us to solve various real-world
problems such as fraud, Detection, spam filtering, etc.

Disadvantages of supervised learning


 Supervised learning models are not suitable for handling complex tasks

4
 If the test data differs from the training dataset, supervised learning
cannot predict the proper output.

 Training required lots of computation times

Association: An association rule is a type of unsupervised learning strategy for


discovering associations between variables in a large database. It identifies the group of
elements in the dataset that occur together. The association rule improves the
effectiveness of marketing strategies. People who buy X goods (such as bread) are
more likely to buy Y (butter/jam) things. Market Basket Analysis is a good example
of the Association rule.

FEATURES OF MACHINE LEARNING:


It is nothing but Automation. Getting computers to program for themselves. Writing
Software is the bottleneck. Machine learning models involve machines learning from
data without the help of humans or any kind of human intervention. A data mining
technique is transforming raw data into an understandable format.

Fig 1.2 Testing dataset accuracy [2]

Feature engineering is the time-consuming and expensive process of modifying data to


assist machine learning algorithms operates better. The current method is divided into
steps. The breast cancer dataset was used in three distinct projects. In the first test, we
demonstrated that with effective configuration, the three most popular evolutionary
algorithms may attain the same performance. The second experiment looked at how
combining different features selecting approaches enhances accuracy. Finally, in the

5
last experiment, we discovered how to create machine learning supervised classifier
automatically. We used the GP algorithm to address the hyper parameter problem,
which is a difficult problem for machine learning algorithms. From among the many
configurations, the proposed algorithm chose the most appropriate algorithm

Applications of Machine Learning:


 Speech recognition, often known as "Speech to text" or "Computer speech
recognition," is the process of turning voice instructions into text. Machine
learning techniques are now widely used in a variety of speech recognition
applications. Speech recognition technology is used by Google assistants,
Siri, Cortana, and Alexa to follow voice commands.
 One of the most common uses of machine learning is image recognition.
It's used to identify things like people, places, and digital photographs.
Automatic buddy tagging suggestion is a common use of picture recognition
and facial identification.
 Machine learning is widely used by various e-commerce and entertainment
companies such as Amazon, Netflix, etc., for product recommendations to the
user. Because of machine learning, whenever we look for a product on
Amazon, we begin to receive advertisements for the same goods while browsing
the internet on the same browser.
 In medical science, machine learning is used for diseases diagnoses. As a
result, medical technology is rapidly evolving, and 3D models that can predict
the exact location of lesions in the brain are now possible. It facilitates the
detection of brain cancers and other brain-related illnesses..
 A sequence-to-sequence learning technique, which is employed with picture
recognition and translates text from one language to another, is behind
the automatic translation.
 Sentiment analysis is a real-time machine learning programme that detects the
speaker's or writer's sentiment or opinion. For example, if someone writes a
review or an email (or any other type of document), a sentiment analyzer will
instantaneously determine the text's true meaning and tone. This sentiment
research app can be used to examine a review-based website, decision-making

6
apps, and more. Machine learning algorithms are being actively implemented by
companies to identify the level of access individuals require in various places
based on their job profiles.

Fig 1.3 ML Architecture [2]

Advantages of ML:

 It predicts problems quickly and in real time..

 Efficiently utilizes the resources.

 Helps in the automation of different tasks.

 It is widely employed in a variety of fields, including business, medicine, sports.

 Helps interpret previous behavior of the model

The fundamental goal of machine learning approaches is to create a model that can be
used for classification, prediction, estimation, and other tasks. The most common task
in the learning process is classification. This learning function, as previously said,
classifies the data item into one of several predetermined classes. Training and
generalization mistakes can occur when ML techniques are used to create a
classification model. The former refers to training data misclassification mistakes,
while the latter pertains to predicted testing data errors. A good classification model
should be able to accurately categories all of the examples in the training set.

Over the last decade, interest in machine learning (ML) for healthcare has exploded.

7
Though machine learning has been an academic discipline since the mid-twentieth
century, advances in computing capabilities, data availability, creative approaches, and
a growing pool of technical talent have hastened its application in healthcare. The
academic and general press have focused much of their emphasis on applications of
machine learning in healthcare delivery; however, uses of machine learning in clinical
research are less frequently highlighted (Fig. 1). Clinical research is a broad subject,
with preclinical research and observational studies leading to traditional trials and
pragmatic trials, which in turn encourage clinical registries and more implementation
research. Clinical research as it is now conducted is complicated, despite its importance
in improving healthcare and outcomes. It is time-consuming, costly, and prone to
unanticipated errors and biases, which might jeopardize its successful application,
implementation, and adoption.

Fig 1.4 Analyzing training dataset [3]

Data collection and administration

The application of machine learning to clinical trials may alter the data collecting,
management, and analysis strategies necessary. ML approaches, on the other hand, can
assist in overcoming some of the challenges connected with missing data and obtaining
real-world data. If the ML model is subverted by purposefully or inadvertently
manipulated sensor data when processing wearable sensor output to extract research
endpoints, the results may be tainted. Other device-related prospects in patient
centricity, aside from the development of novel digital biomarkers, include the capacity
8
to export data and analytics back to participants to facilitate education and insight.
Better defining how previously validated clinical objectives and patient-centric digital
biomarkers overlap, as well as knowing participant attitudes on privacy in relation to
the sharing and use of device data, are all barriers to ML processing of device data
adoption.

Data collection

In prospective trials or retrospective reviews, an intriguing use of ML, specifically


NLP, to study data management is to automate data collecting into case report forms,
reducing the time, expense, and possibility for error associated with human data
extraction. Though overcoming variable data structures and provenances is required for
this application, it has showed early promise in cancer research. ML can fuel risk-based
monitoring approaches to clinical trial surveillance, enabling the prevention and/or
early detection of site failure, fraud, and data discrepancies or incompleteness that
could delay database lock and subsequent analysis, regardless of how data was
gathered. Even when people collect data into case report forms (which are frequently
submitted in PDF format), the appropriateness of the acquired data for result
determination can be checked using a combination of optical character recognition and
natural language processing. Data processing can also benefit from machine learning.
Because, while endpoint adjudication has traditionally been a labor-intensive process,
sorting and classifying events is well within the capabilities of ML, semi-automated
endpoint identification and adjudication has the potential to reduce time, cost, and
complexity when compared to the current approach of manual adjudication of events by
a committee of clinicians. For example, using a mix of optical character recognition and
natural language processing, IQVIA Inc. has described the ability to automatically
process some adverse events connected to medication regimens, albeit this technique
has not been described in peer-reviewed journals. In the realm of cardiovascular
research, efforts have lately been made to standardize results, however not all trials
adhere to these criteria. Most fields have not yet undertaken similar efforts to pool trial
data to aid model training for cardiovascular endpoints. Building an ML-based
framework for cancer prediction and classification while dealing with several data
modalities, i.e. multimodal frameworks, provides a new challenge in the field of cancer
research. The development of integrative prediction models by merging the output of

9
many algorithms is a breakthrough, but it also poses a difficulty in terms of
interpretation and dependability of the models' clinical implications.

Fig 1.5 ML Model [3]

The primary goal of machine learning approaches is to create a model that may be used
for classification, prediction, estimation, or any other job. Classification is the most
prevalent activity in the learning process. This learning function, as previously said,
classifies the data item into one of several predetermined classes. Training and
generalisation mistakes can occur when ML techniques are used to create a
classification model. The former refers to training data misclassification mistakes,
while the latter pertains to predicted testing data errors. A good classification model
should closely match the training set and correctly categorise all cases. It's critical to
measure the classifier's performance after obtaining a classification model using one or
more machine learning approaches. Each proposed model's performance is evaluated in
terms of sensitivity, specificity, accuracy, and area under the curve (AUC). The
proportion of true positives successfully identified by the classifier is known as

10
sensitivity, whereas the proportion of true negatives correctly identified is known as
specificity.

SVMs are a more contemporary approach to machine learning methods used in cancer
prediction and prognosis. SVMs first transfer the input vector into a higher-dimensional
feature space and find the hyperplane that divides the data points into two classes. The
marginal distance between the choice hyperplane and the closest-to-the-boundary
occurrences is maximised. The generated classifier has a high degree of generalizability
and can thus be used to reliably classify fresh samples. It's worth mentioning that SVMs
can also produce probabilistic outcomes. A hyperplane would characterise the
detachment if more components were included. The hyperplane is controlled by bolster
vectors, a subset of the two classes' goals. Officially, the SVM calculation produces a
hyperplane that divides the data into two classes, with the most extreme edge meaning
that the distance between the hyperplane and the nearest models (the edge) is increased.
SVMs can be used to perform non-straight order tasks by utilising a non-direct part. A
non-straight piece is a scientific capability that transforms data from direct component
space to non-straight element space. The presentation of an SVM classifier can be
dramatically improved by applying various parts to distinct informational indices.

Data Analysis

Clinical trial data, registries, and clinical practice data are rich sources for hypothesis
creation, risk modeling, and counterfactual simulation, and machine learning is well
suited for these tasks. Unsupervised learning, for example, can discover phenotypic
groupings in real-world data that can be investigated further in clinical trials.
Furthermore, ML has the potential to improve the widely used technique of secondary
trial analysis by more effectively identifying treatment heterogeneity while still offering
some (insufficient) protection against false-positive findings, revealing more viable
areas for future research. Furthermore, machine learning can be utilised to develop risk
predictions in retrospective datasets that can then be confirmed prospectively.
Researchers were able to enhance categorization between patients who will do better or
worse using a random forest model in companion trial data, for example. Researchers
were able to enhance discriminate between patients who would do better or worse
following cardiac resynchronization therapy using a random forest model in companion
trial data compared to a multivariable logistic regression. This highlights random
11
forests' capacity to predict interactions between features that are missed by simpler
models. In conclusion, there are numerous effective machine learning approaches to
clinical trial data administration, processing, and analysis, but there are less techniques
for enhancing data quality as it is generated and collected. Because data availability and
quality are the foundations of machine learning methodologies, conducting high-quality
trials is critical for higher-level ML processing.

Prognosis and survival prediction for cancer

Another key area of cancer research where AI is expected to make a substantial


contribution is in the care of cancer patients. In particular, studies in this area try to
assess patient prognosis, i.e. forecast approximate survival based on a set of variables
(clinical, imaging, and genomic), evaluate treatment response, and therefore patient
prognosis. Without the use of ML algorithms, particularly DL approaches, such
assessments would be unavoidable due to the volume and complexity of the data.
Specifically, almost 200 research aimed at determining cancer prognosis were
published in the last year alone. The vast majority of them used deep learning
approaches, with only a small number using typical machine learning algorithms. In the
digital world, understanding the mechanisms and reasoning of an ML system could
ensure its dependability. Introducing standardized approaches to assess the robustness
of predictive models in relation to the data used for training, promoting model
transparency through the explain ability-by-design principle for ML-based systems, and
designing methodologies to address vulnerabilities, thereby ensuring the reliability, will
promote the effective and secure use of ML systems. Furthermore, the effective
implementation of good practices for the proper development and deployment of
automated machine-learning-based systems will ensure a regulatory framework for
enhancing ML system confidence. Because black-box models lack logic and explicit
rules, ML approaches have generated problems in automated decision-making tools and
personal data. As a result, technical solutions for the construction of defined principles
are needed to improve the robustness and explain ability of AI/ML systems while also
addressing the difficulties of transparency and repeatability of AI-based solutions.
Transparency and reproducibility in AI are critical for prospectively validating and
applying such technologies and models in clinical practice [21]. To ensure that the
methodology and code that underpin a research publication are adequately documented,

12
several frameworks and reproducible research practices have been established.
Transparency is addressed in terms of common code, software dependencies, and
parameters required training a model, allowing the research study to be conducted in a
more transparent manner. Building an ML-based framework for cancer prediction and
classification while dealing with several data modalities, i.e. multimodal frameworks,
provides a new challenge in the field of cancer research. The development of
integrative prediction models by merging the output of many algorithms is a
breakthrough, but it also poses a difficulty in terms of interpretation and dependability
of the models' clinical implications. The lack of consideration devoted to information
size and student approval was one of the most generally recognised flaws identified
among the investigations reviewed in this study. As it stands, there are a variety of
exams with a disorganised test strategy. An adequate large informative index that can
be parcelled into disjoint prepared and test sets or exposed to some sensible type of n-
overlap cross-approval for smaller informational collection is a basic requirement for
any AI technique. The amount of the data isn't the only constraint to effective machine
learning. Quality data sets and careful feature selection are also critical. Data entry and
data verification are crucial when working with huge data sets. Careless data entry can
frequently result in simple off-by-one errors, in which all of the values for a specific
variable in a table are pushed up or down by one row. This is why having a second data
input curator or data checker perform independent verification is usually advantageous.
Additional data integrity verification or spot checks by a competent specialist, not
merely a data entry clerk, is also a beneficial exercise.

13
CHAPTER 2

LITERATURE REVIEW

Related Literature Review

The motivation behind this research is the rapid growth in cancer incidence and
mortality cases worldwide. The reasons are complicated, but they include population
ageing and expansion, as well as changes in the prevalence and distribution of cancer's
key risk factors. It depicts the cancer incidence cases and death statistics reported by the
American Cancer Society and other reliable resources.

Fig 2.1 Graphical representation of different cancers [6]

It talks about the current methodology used in the medical sector for cancer prediction.
->Screening

->Chemotherapy

Then it talks about the use of artificial intelligence in the medical.

I-based techniques have contributed significantly to the field of cancer research. The
14
research works mentioned in the literature have focussed mainly on deep learning
techniques. Deep learning classifiers have dominated over machine learning models in
the field of cancer research. Among Deep learning models, Convolutional Neural
Networks (CNN) have been used most commonly for cancer prediction; approximately
41% of studies have used CNN to classify cancer. Neural networks (NN) and Deep
Neural Networks (DNN) have also been used extensively in the literature. Ensemble
learning techniques (Random Forest Classifier weighted voting, Gradient Boosting
Machines) and Support vector machines (SVM) are the most commonly utilised in
literature, aside from deep learning approaches.

This graph depicts the distribution of literature based on AI-based prediction models.
Investigation 2: Which cancer site and training data have been explored most
extensively? The majority of the research papers examined in this review focused on
automated cancer prediction diagnosis. The most extensively explored sites are the
breast (followed by the kidney. Most researchers have worked on brain, colorectal,
cervical, and prostate cancer prediction in addition to breast and kidney cancer. It
shows how research works are distributed throughout cancer sites.
The type of data utilised to train the prediction model has a substantial impact on its
performance. The data used to train the classification model has an impact
on the model's dependability and prediction outcomes. Most of the research studies
reviewed in this paper have used Magnetic Resonance Imaging (MRI). The second
most commonly used data is Computed Tomography (CT) scan images.
Other picture types utilised in the literature include dermoscopic, mammographic,
endoscopic, and pathological. It highlights the distribution of papers based on the type
of data used to train the prediction model.
Investigation3: In which year most of the cancer prediction studies have been?

The research works published between 2009 to April 2021 are selected in this review
article It demonstrates the distribution of the articles based on the published year. The
majority of the research papers were published in 2020 (35), 2019 (32), and 2018. (30).
We could only retrieve papers published up until April 2021, hence there are few
papers from 2021.

15
Convolutional Neural Networks models have been used to predict practically every
type of cancer, including brain, colorectal, skin, thyroid, and lungs, due to their
specificity. The majority of studies that looked into breast cancer diagnostic prediction
used hybrid modes or unique methodologies. In addition, neural networks have been
used to practically all datasets related to breast and cervical cancer. Only Convolutional
Neural Networks have been used to study stomach cancer. Support Vector Machines
(SVMs) have been used to forecast of liver and breast cancer.
In a nutshell, Convolutional Neural Networks can be applied with different datasets [1].
Conclusion: This review study attempts to summarize the various research directions
for AI-based cancer prediction models. AI has marked its significance in the area of
healthcare, especially cancer prediction. The paper presents a complete evaluation of
the machine and deep learning models employed in cancer early detection employing
medical imaging, as well as a critical and analytical examination of current state
-of-the-art cancer diagnostic and detection analysis methodologies. Machine and deep
learning techniques for extracting and categorising disease features play a vital role in
early cancer prognosis and diagnosis using AI techniques. Most earlier literature works,
according to our findings, used deep learning approaches, particularly Convolutional
Neural Networks.

Future Scope: As AI & ML are very advance technology so there is a huge scope in
this technique. We have to develop better algorithm to train the object So, it can learn
fast and efficiently.

Using a machine learning system, this work seeks to solve the challenge of automatic
breast cancer detection. The current method is divided into steps. The breast cancer
dataset was used in three distinct projects. In the first test, we demonstrated that with
effective configuration, the three most popular evolutionary algorithms may attain the
same performance. The second experiment looked at how combining different features
selecting approaches enhances accuracy. Finally, in the last experiment, we discovered
how to create a machine learning supervised classifier automatically. We used the GP
algorithm to address the hyperparameter problem, which is a difficult problem for
machine learning algorithms. From among the many configurations, the proposed
algorithm chose the most appropriate algorithm. The Python library was used in all of
the experiments. Despite the fact that the proposed method produced significant results
16
by analysing an ensemble of approaches using an exhaustive machine learning strategy,
we discovered a significantly higher time consumption rate.

Fig 2.2 ANN [7]

Feature selection are also critical. Data entry and data verification are crucial when
Feature extraction is a key stage in the identification of breast cancer since it aids in the
differentiation of benign and malignant tumours. Image attributes such as smoothness,
coarseness, depth, and regularity are extracted via segmentation after extraction. The
goal of this project is to improve the list of data transformations and machine learning
algorithms that will be used to complete the classification transformation. It's
challenging to find the ideal machine learning algorithm and data combination. Genetic
programming (GP) [22] is presented to optimise the data and control parameters of the
proposed model as a result of the increase of hyperparameter tuning. To select the best
combination that leads to the best evaluation findings, this well-known evolutionary
technique must be used. The GP produces a predetermined number of pipelines at
random to make up the population members. Each individual (pipeline) in the
population was assessed based on their fitness, which was used as the categorization
score in this study. Pipelines are being put in place. The pipelines are built using
supervised models from the scikit-learn package. For all classifiers besides linear
discriminant analysis, the hyperparameters optimised in this work are the number of

17
kennels function. The number of kernels is determined
determi at random. The amount of the
data isn't the only constraint to effective machine learning. Quality data sets and careful
feature selection are also critical. Data entry and data verification are crucial when
working with huge data sets. Careless data eentry
ntry can frequently result in simple off-by-
off
one errors, in which all of the values for a specific variable in a table are pushed up or
down by one row. This is why having a second data input curator or data checker
perform independent verification
ation is usual
usually advantageous.

This helps us save a lot of time as well as money for the patient
patient. Challenges faced by
the researchers in the construction of AI
AI-based
based prediction models from the above
literature survey: Limited Data size The most common challenge faced bbyy most of the
studies was insufficient data to train the model. A small sample size implies a smaller
training set which does not authenticate the efficiency of the proposed approaches.
approaches The
paper presents a complete evaluation of the machine and deep learn
learning
ing models
employed in cancer early detection employing medical imaging, as well as a critical
and analytical examination of current state
state-of-the-art
art cancer diagnostic and detection
analysis methodologies [2].

Fig 2.3 Comparis


Comparison of Classifier Accuracy [8]

Conclusion: In the above paper the author apply different algorithm on the datasets to
check their accuracy. The result shows that the gradient algorithms are the one’s have
high accuracy than other. Reason behind this is, In gradient we are using the

18
combination of all the other algorithms and get the best one.

Future Scope: From the above results, Its shows that for better accuracy we have to
use gradient algorithms. So, the cross-validation of the results become very important
to get the optimized result.

Cruz JA at. el. in used predictive power of models. Each model's predictive power was
verified in at least three methods. To begin, the models' training was evaluated and
monitored using 20-fold cross-validation. To reduce the stochastic element involved
with sample partitioning, a bootstrap resampling method was used, which involved
performing cross-validation 5 times and averaging the results. Second, the feature
selection procedure was repeated 100 times within each fold to reduce bias in feature
selection (i.e. selecting the most informative subset of SNPs) (5 times for each of the 20
folds). The results were then compared to a random permutation test, which had a
prediction accuracy of 50% at best. While the authors sought to reduce the random
element of sample partitioning, a preferable strategy would have been to employ leave-
one-out cross-validation, which would have totally eradicated this stochastic factor.

This research is a great example of a well-designed and well-tested machine learning


application. Data for each sample was independently verified for quality assurance and
correctness, and a sufficiently big data set was obtained. To assess the universality of
the machine learning model, blinding sets for validation were available from both the
original data set and an external source.

We sought to explain, compare, and evaluate the performance of several machine


learning algorithms used in cancer prediction and prognosis in this paper. We
discovered several similarities in terms of the sorts of machine learning methods
utilised, the types of training data incorporated, the types of endpoint predictions made,
the types of malignancies analysed, and the overall performance of these systems in
predicting cancer susceptibility or outcomes. While ANNs continue to dominate, it is
clear that a rising number of other machine learning algorithms are being employed,
and that they are being utilised to predict at least three different types of cancer
outcomes. It's also obvious that, as compared to traditional statistical or expert-based
systems, machine learning methods improve the performance or prediction accuracy of
most prognoses. While the majority of studies are well-designed and verified, more

19
attention to experimental design and implementation appears to be necessary,
particularly in terms of the quantity and quality of biological data [3].

Conclusion: The above paper talk about the risk of Cancer and how much important it
is to detect it at early stage. The paper shows the results about survival rate when the
cancer get detected at earlier stage & how much fatal it can be when get detected late.

It also focuses on the upcoming development in the field of medical sciences and
collection of data.

The BraTS dataset is a collection of brain tumour MRI scans obtained from various
different locations under standard clinical conditions, but using different equipment and
imaging techniques, resulting in significantly different image quality reflecting diverse
clinical practise across institutions. The following tumour annotation technique, on the
other hand, was created to allow for equivalent ground truth delineations across
different annotators. However, there are non-necrotic, non-cystic patches in high-grade
tumours that do not enhance but may be distinguished from the surrounding vasogenic
edoema and constitute nonenhancing infiltrative tumour. Another issue is the tumour
centre definition in low-grade gliomas. It's tough to tell the difference between tumour
and vasogenic edoema in these circumstances, especially if there's no enhancement.
While individual automated segmentation systems have increased in accuracy, their
robustness remains inferior to expert performance, as measured by inter-rater
agreement. This robustness is predicted to improve as the training set grows in size, as
more diverse patient populations are captured and described, as well as improved
training schemes and ML architectures. Beyond these speculative predictions, our
quantitative analyses reveal that the fusion of segmentation labels from diverse
automated approaches is more resilient than the ground truth inter-rater agreement
(given by clinical professionals) in terms of both accuracy and consistency across
participants. However, the proposed ways to ensemble many models are a viable way to
eliminate outliers and enhance automated segmentation precision. Future research is
critical, according to us, in order to improve the robustness of individual techniques by
improving the ability of segmentation algorithms to manage confounding effects that
are common in images taken through ordinary clinical workflows[4].

Conclusion: The paper looks on the data of BraTS to look after the feature extracting in

20
the MRI, it looks after the segmentation, testing, survival. Basically it checks every
aspect of data to extract information of it. Every year BraTS collects data at different
processes & in this paper author read the data to extract few information.

Future Scope: The collection of data is going to be very important because if there will
be data then we can garner more information from it. It will be very beneficial & can
reduce the work of knowing information everytime rather than storing.

Magnetic resonance imaging is one of the most frequent approaches for detecting brain
tumours (MRI). It provides crucial data for examining the human body's interior
structure in depth. Because of the variety and complexity of brain tumours, MR image
categorization is a difficult endeavour. Sigma filtering, adaptive threshold, and
detection region are among the steps in the suggested technique for detecting a brain
tumour in MR images. Major Axis Length, Euler Number, Minor Axis Length,
Solidity, Area, and Circularity are some of the shape features that are taken into
account when extracting features for MR images. The suggested method employs two
supervised classifiers; the first is the C4.5 decision tree algorithm, and the second is the
C4.5 decision tree algorithm. The suggested method employs two supervised
classifiers: the C4.5 decision tree algorithm and the Multi-Layer Perceptron (MLP)
algorithm. The classifiers are used to categorize brain cases as normal or abnormal;
abnormal brain cases are divided into one type of benign tumour and five types of
malignant tumours. Using the MLP algorithm and 174 samples of brain MR images,
maximum precision of roughly 95% is attained [5].

Conclusion: From above paper we can conclude that the MLP algorithm is better than
C4.5 decision tree algorithm as MLP gives the better accuracy. C4.5 decision tree
algorithm is faster but not precise.

Future Scope: As we already know that MLP is more accurate than C4.5 decision tree
then we always use MLP or algorithm better than MLP.

The authors used the Nave Bayes Classifier, Support Vector Machine (SVM) Classifier,
Bi-clustering and Ada boost Techniques, R-CNN (Convolutional Neural Networks)
Classifier, and Bidirectional Recurrent Neural Networks (HA-BiRNN). This section

21
explains these techniques. RFE and SVM are combined in the SVM Classifier method
[6]. RFE is a recursive strategy for selecting dataset features based on the least feature
value. As a result, in all rounds of SVM-RFE, the incorrect features (lowest weight
feature) are removed. The pre-processed photos are used to extract entropy,
geometrical, and textural properties. Entropy (E) is a haphazardness assessment that is
used to describe the texture of an input image [11]. Shape characteristics serve a crucial
role in distinguishing between normal and cancerous cells. Each picture is divided into
'n' sub squares and quantified in textual features.

The authors present the new DNNS breast cancer detection algorithm. Unlike other
methods, the proposed solution is based on a deep neural network's Support value. A
normalising procedure has been used to improve the performance, efficiency, and
quality of photographs. Experiments have shown that the suggested DNNS is far
superior to existing approaches. The suggested method is guaranteed to be favourable
in terms of performance, efficiency, and image quality, all of which are critical in
today's medical systems [6].

The suggested system describes the breast cancer model and depicts the Support Vector
Machine (SVM) algorithm's execution in classifying breast cancer tumours as benign or
malignant. The dimensionality reduction technique is implemented in the preparation of
data module. Dimensionality reduction is the process of reducing the number of
independent variables to a small number of key variables by eliminating those that
aren't as important in forecasting the outcome. The data is analysed, and a model is
created to predict whether or not the tumour is malignant. It's a binary classification
problem, and the data is checked for accuracy using a few techniques.

The dataset retrieved from the UCI machine learning repository is used to train and test
our proposed system. Radius, texture, perimeter, area, smoothness, compactness,
concavity, concave, concave points, and fractal dimension are among the 10 real-valued
features in the dataset. The cell nucleus of the breast mammography is used to calculate
these properties. The correlation of two variables is the statistical link between them.
With precise and dependable prediction effects, an appropriate breast cancer diagnosis
model can assist scientific practitioners.

Early identification of breast cancer allows it to be treated, increasing the odds of

22
survival. With 99 percent accuracy, the proposed approach diagnoses tumours as
malignant or benign using attributes extracted from cell pictures. As a result, it can be
utilised as an effective tool for detecting and preventing breast cancer. In this arena, the
integration of multidimensional data with various categorization, feature selection, and
dimensionality reduction approaches can give excellent inference tools. More study
may be done in this area to improve the classification techniques' ability to predict
using different factors.

From this survey we conclude that, most of the automatic cancer predication systems
are based on machine learning concepts including classification and clustering
algorithms. This paper presented an extensive review of various ML classification
techniques for the prediction of cancer and standard datasets have been used in wide
variety of cancer such as brain cancer and breast cancer. A detailed list of results found
by many researchers has been tabulated to solve the problems by various computational
intelligence techniques. The most successful approach is SVM and combination of
SVM technique which gave up to 99% accuracy on a smaller number of training

Fig 2.4 Breast Cancer [9]

datasets which is not a good prediction in case with large datasets. However, options
are available for the possibilities of improvement of predicting the cancer at an early
stage. There are many datasets available to explore more for the same. There are large
numbers of cancer types available with unknown functions [7].
23
CHAPTER 3

PROBLEM FORMULATION

3.1 Problem Discussion

Over the years, a continuous evolution related to cancer research has been performed
Scientists used various methods, as early-stage screening, so that they could find
different types of cancer before it could do any damage. With this research, they were
able to develop new strategies to help predict early cancer treatment outcomes. With
the arrival of new technology in the medical field, a huge amount of data related to
cancer has been collected and is available for medical research. But, physicians find the
accurate prediction of the cancer outcome as the most interesting yet challenging part.

For this reason, machine learning techniques have become popular among researchers.
These tools can help discover and identify patterns and relationships between the
cancer data, from huge datasets, , while they can effectively predict future outcomes of
cancer type. Patients have to spend a lot of money on different tests and treatments to
check whether they have breast cancer or not. These tests can take a long time and the
results can be delayed. Also, after confirmation that the patient has cancer, more tests
need to be do ne to check whether the cancer is benign or malignant. In this project, I
will be using different machine learning techniques to analyze the data given in the
datasets. This analysis will help us predict whether the cancer is benign or malignant.
Benign cancer is cancer that doesn’t spread whereas malignant cancer cells spread
across the body making it very dangerous. This dataset analysis will be aided by
machine learning algorithms. These techniques will be used to predict the outcome. The
outcome can be either that the cancer is benign or malignant. Benign cancer is cancer
that doesn’t spread whereas cancer cells spread across the body making it very
dangerous.

This prediction can help doctors prescribe different medical examinations for the
patients based on the cancer type. This helps us save a lot of time as well as money for
the patient. Challenges faced by the researchers in the construction of AI-based
prediction models from the above literature survey: Limited Data size The most
common challenge faced by most of the studies was insufficient data to train the model.
24
A small sample size implies a smaller training set which does not authenticate the
efficiency of the proposed approaches. A good sample size can train the model better
than the limited one. Multiple dimensions High dimensionality is another data-related
challenge in cancer research. When compared to cases, high dimensionality refers to a
large number of features.

Problem of class disparity The uneven distribution of classes in medical data sets,
particularly cancer data, is a major concern. Class imbalance arises due to a miss-match
of the sample size of each class. Computational time About 90% of studies have
endorsed deep learning approaches to predict cancer using medical images than other
techniques. Deep learning-based techniques, on the other hand, are quite complicated..
About 41% of the studies have used the CNN classifier, which has performed
significantly but at the cost of high computational time and space. Clinical
Application Although AI-based models have proven their dominance in cancer
research, their practical deployment in clinics has yet to be implemented. These
models need to be validated in a clinical setting to assist the medical practitioner in
affirming the diagnosis verdicts. Model Generalizability. A shift in research towards
improving the generalizability of the model is required. The majority of studies
have advocated a single-site validation model for prediction. The models must be
validated on numerous sites in order to increase the model's generalizability.

3.2 Problem Statement

How do we increase the accuracy of cancer cell detection using different algorithms?

Limited Data size is the most common challenge faced by most of the studies was
insufficient data to train the model.

Efficient feature selection technique Many studies have achieved exceptional prediction
outcomes. However, the requirement of a computationally effective feature selection
method is still there to eradicate the data cleaning procedures while generating high
cancer prediction accuracy.

Clinical Implementation AI-based models have proved their dominance in cancer


research; still, the practical implementation of the models in the clinics is not
25
incorporated. These models need to be validated in a clinical setting to assist the
medical practitioner in affirming the diagnosis verdicts. Getting a High dimensional
dataset is also a very big issue.

3.3 Depiction of Problem Statement

Fig 3.1 Data Flow Diagram [9]

3.4 Objectives

Improving cancer detection and localization: Models can classify as well as locate the
position of cancer cells with high accuracy.

Improving Speed for real-time detection: Building a model which can perform
complex computation like neural network algorithm and position detection within a
limited interval with good speed and accuracy.

Developing Health Infrastructure for medical: To determine significant risk factors


based on a medical dataset that may lead to Diseases and get knowledge about the
disease as early as possible.

26
CHAPTER 4

PROPOSED WORK

4.1 Introduction

In the proposed system we plan on using existing data of cancer patients which has
been collected for several years and run different machine learning algorithms on them.
These algorithms will analyze the data from the datasets to predict whether the patient
has cancer or not and it will also tell us if the cancer is malignant or benign.

It is done by taking the patient’s data and mapping it with the data set and checking
whether there are any patterns found with the data. If a patient has breast cancer, then
instead of taking more tests to check whether the cancer is malignant or benign, ML
can be used to predict the case based on the huge amount of data on breast cancer. This
proposed system helps the patients as it reduces the amount of money they need to
spend just for the diagnosis. Also, if the tumor is benign, then it is not cancerous, and
the patient doesn’t need to go through any of the other tests. This also saves a lot of
time.

Fig 4.1 Flow Diagram

The above Description is all about our model, our research paper, and techniques
processing. Now the most important part of any software or application is its user
interface. How an interactive and friendly environment is provided to the user so that its
daily task and to work with it the whole day becomes easier. The fundamental goal

27
of machine learning approaches is to create a model that can be used for classification,
prediction, estimation, and other tasks. The most common task in the learning process
is classification. This learning function, as previously said, classifies the data item
into one of several predetermined classes. Training and generalization mistakes can
occur when ML techniques are used to create a classification model. The former refers
to training data misclassification mistakes, while the latter pertains to predicted
testing data errors. A good classification model should be able to accurately
categories all of the examples in the training set. If the test error rates of a model
begin to increase even though the training error rates decrease then the phenomenon
of model over fitting occurs. This scenario is related to model complexity, which
means that as the model complexity increases, the training errors of the model
will decrease. The least generalization error is the optimum complexity of a model that
is not prone to over fitting. The bias-variance decomposition is a formal method for
assessing a learning algorithm's expected generalization error.

The bias component of a particular learning algorithm measures the error rate of that
algorithm. Additionally, a second source of error over all possible training sets of a
given size and all possible test sets is called the variance of the learning method.

Data-sets: This part of the project includes the research about the available dataset that
can be used for the project. The initial dataset on which we planned to work had certain
flaws. CBIS-DDSM dataset: Neither the analytics nor the experimental validation of
the data was sufficient to move forward with the project. There was no exact data
regarding the classification of tumors into benign and malignant in the case of scan
images of patients. One of the most important demerits of this dataset was the improper
organization of data and huge redundant records which can cause glitches in the
training process. The most common task in the learning process is classification. This
learning function, as previously said, classifies the data item into one of several
predetermined classes.

28
Pre-processing data: Data preprocessing is a data mining technique that is used to filter
data in a usable format. Because the real-world data-set is almost available in different
formats. Because it isn't available in the format we require, it must fit the data set in
an intelligible manner.

Fig 4.2 Non Cancerous and Cancerous Cell

Encoder Method: Label Encoder is an efficient tool for encoding the levels of the
categorical features into numeric values. Label Encoder encodes labels with values
between 0 and classes -1. All our categorical features are encoded. In this paper, we
have classified malignant and Benign in diagnosis with 0 and 1. After encoding the
dataset we have applied a neural network on this dataset and achieved accuracy.

The dataset contains these features:

 radius

 texture

 perimeter

 area

 smoothness

 compactness

 concavity

 concave points

 symmetry

 fractal dimension
29
Table 4.1 Measuring Factors Table

30
4.2 Proposed Methodology

Step 1: Collection of the dataset.

Step 2: Understanding features of the dataset.

Step 3: Pre-processing the data.

Step 4: Split data into a training dataset and testing dataset.

Step 5: Apply ML algorithms to the dataset to predict cancer.

Step 6: Improving results by applying algorithms.

4.3 Algorithms

This algorithm is one of the simplest Machine learning techniques. It is a lazy


learning algorithm used for regression and classification. It classifies the objects using
their “k” nearest neighbors. k-NN only considers the neighbors around the object, not
the underlying data distribution. If k = 1, it assigns the unknown to the class of the
nearest neighbor. If k q3q[l]fl3 > 1, the classification is decided by majority vote based
on the k nearest neighbor prediction result.

A effective strategy for both classification and regression is to allocate weight to the
contributions of the neighbors, so that the closer neighbors contribute more to the
average than the farther ones. For example, a common weighting scheme consists in
giving each neighbor a weight of 1/d, where d is the distance to the neighbor. When
KNN issued for classification, the output can be calculated as the class with the highest
frequency from the K-most similar instances. Each instance in essence votes for their
class and the class with the most votes is taken as the prediction. If you are using K and
you have an even number of classes (e.g. 2) it is a good idea to choose the K value with
an odd number to avoid a tie. And the inverse, use an even number for K when you
have an odd number of classes. Expanding K by and looking at the class of the next
most similar case in the training dataset can reliably break ties.

Here are a few things to remember:

Our forecasts become less stable as we reduce the value of K to one. Inversely, as we

31
increase the value of K, our predictions become more stable due to majority voting /
averaging, and thus, more likely to make more accurate predictions (up to a certain
point). Eventually, we begin to witness an increasing number of errors. It is at this point
we know we have pushed the value of K too far. In cases where we are taking a
majority vote (e.g. picking the mode in a classification problem) among labels, we
usually make K an odd number to have a tiebreaker.

Advantages:

 The algorithm is straightforward and simple to implement.

 There's no need to create a model, adjust a few parameters, or make any


additional assumptions.

 The algorithm is versatile. It can be used for classification, regression, and


search (as we will see in the next section).

Disadvantages:

The algorithm gets significantly slower as the number of examples and prediction
dependent variables increase.

Fig 4.3 Pictorial representation of K-Nearest Algorithm

32
NAIVE BAYES

Naive Bayes classifier is based on Bayes’ theorem and is one of the oldest approaches

for classification problems. The formula is:

The goal is to figure out how likely something is to happen. A occurs when B
occurs. The naive Bayes classifier combines Bayes’ model with decision rules like the
hypothesis which is the most probable outcome. The "naive" assumption of conditional
independence between any pair of features given the value of the class variable
underpins a collection of supervised learning algorithms known as nnaive
aive Bayes
methods. It was created for text categorization problems and is still used as a
benchmark today.

Pros:

Predicting the test data set's class is simple and quick. It's also good at multi
multi-
class prediction.

When the assumption of independ


independence
ence holds, a Naive Bayes classifier performs better
compare too the r models like logistic regression and you needless training data.
data

Real-time
time Prediction: Naive Baye
Bayes is an eager learning classifier and it is super fast.
Thus, it could be used for making predictions in real-time.

Multi-class
class Prediction: This algorithm is also well known for its multi-class
multi prediction
feature. Here we can predict the probability of multiple classes of target variable
variable.

Text classification, spam filtering, and sentiment analy


analysis:
sis: Naive Bayes classifiers have
a higher success rate than other algorithms in text classification(owing
classification(owing to better
outcomes in multi-class
class situations and the independence criterion). As a result, it's
commonly utilised in spam
am filtering and sentiment analysis.

It performs well in the case of categorical input variables compared to a numerical

33
variable(s). The normal distribution is assumed for numerical variables (bell curve,
which is a strong assumption).

Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly
used in text classification (due to better results in multi-class problems independence
rule) have a higher success rate as compared to other algorithms. As a result, it's
commonly utilised in spam filtering and sentiment analysis (in social media analysis,
to identify positive and negative customer sentiments).

2. STOCHASTIC GRADIENT DESCENT

Stochastic gradient descent is a prominent and widely used technique in machine


learning algorithms, and it is the foundation of neural networks. Gradient descent is an
iterative process that starts at a random point on a function and gradually descends its
slope until it reaches the function's lowest point." When the optimal locations cannot be
determined simply equating the function's slope to 0, this algorithm comes in handy.
The algorithm's core is this. The main concept is to start with a random point (in our
parabola example, a random "x") and figure out how to update it with each iteration so
that we descend the slope.

The algorithm's steps are as follows:

Determine the objective function's slope in relation to each parameter/feature. To put it


another way, find the function's gradient. SVMs first transfer the input vector into a
higher-dimensional feature space and find the hyperplane that divides the data points
into two classes. The marginal distance between the choice hyperplane and the closest-
to-the-boundary occurrences is maximised. The generated classifier has a high degree
of generalizability and can thus be used to reliably classify fresh samples. It's worth
mentioning that SVMs can also produce probabilistic outcomes the SVM calculation
produces a hyperplane that divides the data into two classes, with the most extreme
edge meaning that the distance between the hyperplane and the nearest models (the
edge) is increased. SVMs can be used to perform non-straight order tasks by utilising a
non-direct part.

For the parameters, choose a random starting value. (To be clear, differentiate "y" from
34
“x"
x" in the parabolic example.) If there are more characteristics such as x1, x2, and so
on, we compute the partial derivative of "y" with regard to each one).

Fi
Fig 4.4 SGD Graph

Fill in the parameter values to update the gradient function.

Step size = gradient * learning rate = step size for each feature

The new parameters are calculated as follows: new params = old params - step size

Rep steps 3–5 until the gradient


ent is almost zero.

There are three different


erent types of gradient descent:
descent

 Batch Gradient Descent

 Stochastic Gradient Descent

 Mini-batch
batch Gradient Descent

Stochastic Gradient Descent's Benefits

 Because the network processes only one training sample, it is eas


easyy to remember.

 Because only one sample is processed at a time, it is computationally efficient.


35
 It can converge faster for larger datasets because the parameters are updated
more frequently.

 The steps taken towards the loss function minima have oscillations as a result of
frequent updates, which can help get out of local minimums (in case the
computed position turns out to be the local minimum)

3. SVM

SVM stands for Support Vector Machine and is one of the most widely used
Supervised Learning algorithms for Classification and Regression issues. However, it is
mostly used in Machine Learning to solve classification problems.

The goal of the SVM method is to discover the best line or decision boundary for
categorising n-dimensional space into classes so that subsequent data points can
be easily placed in the right category. The ideal choice boundary is known as
a hyperplane. SVM is used to select the the SVM calculation produces a hyperplane
that divides the data into two classes, with the most extreme edge meaning that the
distance between the hyperplane and the nearest models (the edge) is increased. SVMs
can be used to perform non-straight order tasks by utilising a non-direct part extreme
points/vectors that help build the hyperplane. The algorithm is known as a Support
Vector Machine, and support vectors are the extreme examples.. Consider the picture
below, which shows how two distinct categories are identified using a decision
boundary.

There are two types of SVM:

 Linear SVM: Linear SVM is a classifier for linearly separable data,


which means that a dataset can be classified into two classes using a single
straight line, and the classifier is termed Linear SVM.

 Non-linear SVM: Non-linear SVM is used for non-linearly separated data,


which implies that if a dataset can't be classified using a straight line, it's non-
linear data, and the classifier employed is called Non-linear SVM.

36
Fig 4.5 SVM Linear

There can be various lines/d


lines/decision
ecision boundaries to separate the classes in nn-
dimensional space, but we must choose the best decision boundary to help classify
the data points. The best boundary is referred to as the hyperplane of SVM.

As a result, the SVM approach assists in det


determining
ermining the optimal line or decision
boundary, also known as a hyperplane. The SVM approach finds the point where
the lines of both classes meet. These points are referred to as support vectors.
The distance between the vectors and the hyperplane
hyperplane is called the margin. SVM's goal
is to maximise this margin as much as feasible. The hyperplane with the biggest
margin is the ideal hyperplane
hyperplane.

Functions of the Kernel:-

To begin implementing kernel functions, use the command prompt terminal to install
the "scikit-learn" library.

The Gaussian Kernel is used to convert data when no prior knowledge of the data
exists.
37
The Gaussian Kernel Radial Basis Function (RBF) is the same as the preceding kernel
function, but with the addition of the radial basis approach to improve the
transformation.

Sigmoid Kernel: This function is equivalent to a two-layer perceptron model of the


neural network, which is utilised as an artificial neuron activation function.

Polynomial Kernel: A polynomial kernel shows the similarity of vectors in the training
set of data in a feature space over polynomials of the original variables utilised in the
kernel. The library focuses on data modelling. It is not designed to load, manipulate, or
summarise data. Refer to NumPy and Pandas for these functionalities. In Python,
Scikit-learn (Sklearn) is the most usable and robust machine learning library. It uses a
Python consistency interface to deliver a set of efficient machine learning and statistical
modelling capabilities, such as classification, regression, clustering, and dimensionality
reduction.

Fig 4.6 SVM kernel

38
CHAPTER 5

SYSTEM DESIGN

In the previous chapter the challenges are discussed that are faced during data analysis
and the potential solutions. Now in this chapter the process followed to design the
system is being discussed. It defines the view of the project as seen on the levels of
architecture, modules, data and the user.

5.1 Functional Specification of System*

Cancer dataset

Pre-
Pre-processing

Feature Selection
Feature Selection

DataPartition
Data Partition

Classification
Classification

Cancerous Non-Cancerous

Fig 5.1 Function specification of System

39
5.2 Structural and Dynamic Modeling of System*

5.2.1 Use case Diagrams

5.2 Use Case Diagram

Use case diagram basically describes the high level functions and scope of a system. It
also identifies the interactions between the systems and its actors. The use cases and
actors in the use case diagram describe what the system does and how the actors use it,
but not how the system operates internally.

A use case diagram is used to represent the dynamic behavior of a system. It


encapsulates the system’s functionality by incorporating use cases, actors and their
relationships. It models the tasks, services and functions required by a system of an
application.

40
5.2. State Chart /Activity Diagram

Fig 5.3 Activity Diagram

Activity diagram is an important diagram in UML to describe the dynamic aspects of


the system. It is basically a flowchart to represent the flow from one activity to another
activity. The activity can be described as an operation of a system.

41
5.2.3 Component /Deployment Diagram

Fig 5.4 Component Diagram

Component diagram represents the physical components of a system, or it can be


defined as the organization of components inside a system. It is also used to visualize
the static implementation view of a system.

42
CHAPTER 6

IMPLEMENTATION

6.1 Experimental Setup

6.1.1 Algorithms/techniques used

SVM: SVM stands for Support Vector Machine and is one of the most widely used
Supervised Learning algorithms for Classification and Regression issues. However, it is
mostly used in Machine Learning to solve classification problems.

The goal of the SVM method is to discover the best line or decision boundary for
categorising n-dimensional space into classes so that subsequent data points can be
easily placed in the right category. The ideal choice boundary is known as a
hyperplane. SVM is used to select the extreme points/vectors that help build the
hyperplane. The algorithm is known as a Support Vector Machine, and support vectors
are the extreme examples. Consider the picture below, which shows how two distinct
categories are identified using a decision boundary.

KNN: KNN isused for classification, the the output can be calculated as the class with
the highest frequency from the K-most similar instances. In essence, each instance
votes for their class, and the class with the most votes is deemed the winner.

If you're working with K and an even number of classes, (e.g. 2) it is a good idea to
choose the K value with an odd number to avoid a tie. And the inverse,use an even
number for K when you have an odd number of classes.

SGD: SGD (Stochastic Gradient Descent) is a fast and simple method for fitting
linear classifiers and regressors to convex loss functions such as (linear) Support Vector
Machines and Logistic Regression. Despite the fact that SGD has been present for a
long time in the machine learning field, it has only lately gotten a lot of attention in the
context of large-scale learning. SGD has been used to solve large-scale, sparse machine
learning issues that are common in text categorization and natural language processing.
Because the data is sparse, the classifiers in this module can easily scale to problems
with more than 105 training samples and features.
43
6.1.2 Software tools used

We have used various technologies which includes:

Python 3.8+ (from the Anaconda Distribution)

Anaconda is a distribution of the Python and R programming languages for scientific


computing, that aims to simplify package management and deployment. Data-science
packages for Windows, Linux, and macOS are included in the release.

Anaconda is useful because it comes with Python and roughly 200 more Python
packages. All of the other programmes are free to use. Many of the most common
Python packages for solving problems are included in Anaconda's packages. Anaconda
is a Python distribution that includes The Standard Library as well as 200 more
packages. Python and The Standard Library are the only modules available when you
download Python from Python.org. You could install the Anaconda extra modules
(which aren't included with plain old Python), but why not save a step (or 200)
and just download one item (Anaconda) instead of 201 and one things (200
extra).[18]

Additional Libraries

Scikit-learn provides a standard Python interface for a variety of supervised and


unsupervised learning techniques. It is distributed under several Linux distributions and
is licenced under a permissive simplified BSD licence, encouraging academic and
commercial use. The library focuses on data modelling. It is not designed to load,
manipulate, or summarise data. Refer to NumPy and Pandas for these functionalities. In
Python, Scikit-learn (Sklearn) is the most usable and robust machine learning library. It
uses a Python consistency interface to deliver a set of efficient machine learning and
statistical modelling capabilities, such as classification, regression, clustering, and
dimensionality reduction. NumPy, SciPy, and Matplotlib are the foundations of this

Python-based toolkit

Installation on windows:

pip install scikit-learn

44
Tensor Flow

TensorFlow is a free and open-source software library for machine learning and
artificial intelligence. It can be used across a range of tasks but has a particular focus on
training and inference of deep neural networks.

Installation on windows:

pip install tensorflow

Matplotlib:

Matplotlib is a free software machine learning library for the Python programming
language. It features various classification, regression and clustering algorithms
including support vector machines.

Numpy:

NumPy is a Python package that allows you to interact with arrays. It also has matrices,
Fourier transforms, and linear algebra functions.

NumPy was created by Travis Oliphant in 2005. You are free to use it because it is
an open source project. Unlike lists, NumPy arrays are kept in a single continuous
area in memory, making it easy for programmes to access and modify them.

This is known as locality of reference in computer science. This is the main reason
why NumPy performs better than lists. It's also been updated to support the most
recent CPU architectures..

Seaborn:

Seaborn is a matplotlib-based Python data visualisation package. It has a high-level


interface for creating visually appealing and instructive statistics visuals. Its charting
functions work with dataframes and arrays containing entire datasets, performing the
necessary semantic mapping and statistical aggregation internally to generate useful
graphs. Its dataset-oriented, declarative API allows you to concentrate on the meaning
of your charts rather than the mechanics of drawing them.

45
6.2 Dataset Description

6.2.1 Source of Dataset

The source of dataset is kaggle.

Unzip the compressed data files and store in the format as mentioned :

Using the helper function download_files() present in data_utils.py as


follows to do this in your current working directory automatically. (After
the compressed files have been successfully extracted, the function
will delete them.)

Fig 6.1 Dataset

import data_utils

data_utils.download_files()

The following file is corrupted which gives an error when being loaded. Delete it before
proceeding.

46
6.2.1 Size
ze (No. of Samples) and description of attributes

For each cell nucleus, ten real-valued


real characteristics are computed:

a) radius (average distance between centre and perimeter points)

b) texture (standard deviation of gray


gray-scale values)

c) area

Fig 6.2 Features of dataset

d) perimeter

e) suppleness (local variation in radius lengths)

f) compactness (area / perimeter - 1.0) concavity

g) (severity of concave portions of the contour) concave pts

h) (number of concave portions of the contour)

symmetry I fractal dimension (j) ("coastline approximation" - 1) Each image's mean,


standard error, and "worst" or worst (mean of the three largest values) characteristics
were calculated, yielding 30 features. For example, field 3 represents Mean Radius,
field 13 represents Radius SE, and field 23 represents Worst Radius.

47
The output of SGD code also depicts a couple of graphical results:

i) ROC (Receiver operating curve): It is the graphical way of showing the


connection between clinical sensitivity and specificity for eevery
very possible
cut-off
off for a test or a combination of tests.
ii) Confusion Matrix: It is a matrix that is used to describe the performance of a
classification model on a set of test data for which the true values are
known. It is a n*n matrix where n represents the number of target classes.

Fig 6.3 The output specifications obtained on compiling the SGD code

Fig 6.4 The output specifications obtained on compiling the SVM (Linear) code
48
The output of SVM (linear) also depicts two graphs one of confusion matrix and the
other of ROC curve. The accuracy of SVM (linear) is compa
comparatively
ratively more than that of
SGD.

Fig 6.5 The output specifications obtained on compiling the SVM (Gaussian) code
The output of SVM (Gaussian) also depicts two graphs one of confusion matrix and the
other of ROC curve. The accuracy of SVM (Gaussian) is
is comparatively more than that
of SGD and SVM (linear).

Fig 6.6 The dataset representation of Breast Cancer


49
The above figure 6.6 represents the dataset of Breast Cancer and its classification into
malignant and benign. The ‘M’ represents malignant and ‘B’ represents benign. The
‘M’ and ‘B’ are the representation of dataset before encoding of the datasets.

After encoding of the datasets it is represents in the binary form of 0 and 1.

0 (zero) in the figure 6.6 represents ‘B’ that is Benign and 1 (one) represents ‘M’ that
is malignant.

50
CHAPTER 7

RESULT ANALYSIS

7.1 Performance Measures

Fig 7.1 Confusion matrix for dataset

Metrics:

Once the model has been trained on the training data, its performance will

be evaluated
luated using the test data.

The following metrics will be used:

 Accuracy - will be used for evaluating the performance of the model on the test
data.

 Confusion Matrix - will be used in order to compare the model with the Benchmark
model.
51
A classification model's
odel's performance is described using a confusion matrix.

Results for Stochastic


tochastic gradient descent:
descent

Fig 7.2 ROC of (SGD)

Receiver operating characteristic curve or the ROC curve is a graph showing the
performance
mance of a classification model at all classification threshold. The accuracy of a
ROC curve in the SGD algorithm is 94.15%. The precision of the SGD is 97% and the
area under the curve is approximately 97% which means it can be considered as a
potential algorithm
lgorithm for disease prediction.

Fig 7.3 Confusion matrix for SGD

52
Confusion matrix is a table that is used to define the performance of a classification
algorithm. Confusion matrix of SGD algorithm is inclined towards positive
sitive prediction
side thereby making SGD a good algorithm for classification and prediction.

Results for SVM Linear:

Fig 7.4 ROC of SVM (linear)

The accuracy of the ROC curve of SVM (linear) is 96.49% and the area under the curve
is approximately 97%. SVM with linear kernel is considered to be a better algorithm in
comparison to SGD algorithm.

Fig 7.5 Confusion matrix for SVM (linear)

53
Confusion
usion matrix of SVM algorithm with linear kernel is inclined mostly towards
positive prediction side thereby making SVM with linear kernel a great algorithm for
classification and prediction problems.

Results for SVM Gaussian:

Fig 7.6 ROC of SVM (Gaussian)

The accuracy of the ROC curve of SVM (linear) is 97.66% and the area under the curve
is approximately 98%. SVM with Gaussian kernel is considered to be a better algorithm
in comparison to SGD algorithm and SVM with linear kernel.

Fig 7.7 Confusion matrix for SVM (Gaussian)

54
Confusion matrix of SVM algorithm with Gaussian kernel is inclined mostly towards
positive prediction side thereby making SVM with linear kernel a great algorithm for
classification and prediction problems

7.2 Performance Analysis

The accuracy for the Stochastic Gradient Classifier is 94.15.

The accuracy for the SVM (linear) is 96.49.

The accuracy for the SVM (Gaussian) is 97.66.

RESULT: The accuracy of the SVM (Gaussian) is highest.

55
CHAPTER 8

CONCLUSION, LIMITATION AND FUTURE SCOPE

8.1 Conclusion

The research makes it easier to analyze the cancer cell detection in least time
consuming and efficient and cost saving way. This model also shows how to
avoid complex calculations and find the most efficient way to analyze
cancer detection.

8.2 Limitation

The model is able to classify the cancerous & non-cancerous but its accuracy can be
improved if more dataset can be fed into it. Preprocessing of the data may take high
system requirements. It may result out into memory error so it is preferred to use better
system for the result. We have not optimize it by cross-validation which may also help
it in improving the detection.

8.4 Future Scope

In hospitals, provide an app-based user interface that allows clinicians to quickly assess
the impact of a tumour and make treatment recommendations. We can try to forecast
the location and stage of the tumour from Volume based 3D images because the
performance and complexity of ConvNets are dependent on the input data
representation. Training, planning, and computer guidance during surgery are all
improved by building three-dimensional (3D) anatomical models from specific patients.

Because the algorithms are still evolving, there is a probability that cancer cell
detection will improve at an earlier stage. Because the number of cancer patients is
increasing every day, cancer cell identification has a lot of potential.

56
REFERENCES
[1] Kumar, Y., Gupta, S., Singla, R. et al. A Systematic Review of Artificial
Intelligence Techniques in Cancer Prediction and Diagnosis. Arch Computational
Methods Eng (2021).

[2] Habib Dhahri, Eslam Al Maghayreh, Awais Mahmood, Wail Elkilani, Mohammed
Faisal Nagi, "Automated Breast Cancer Diagnosis Based on Machine Learning
Algorithms", JournalofHealthcareEngineering, vol. 2019, ArticleID 4253641, 11 pages
, 2019.

[3] Cruz JA, Wishart DS. Applications of Machine Learning in Cancer Prediction and
Prognosis. Cancer Informatics. January 2006. doi:10.1177/117693510600200030

[4] S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, et al., "Identifying


the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression
Assessment, and Overall Survival Prediction in the BRATS Challenge", arXiv preprint
arXiv:1811.02629 (2018)

[5] George2016BrainTD,title={Brain Tumor Detection Using Shape features and


Machine Learning Algorithms},author={Dena Nadir George and Hashem B. Jehlol and
Anwer Subhi Abdulhussein Oleiwi},year={2016}

[6] Anji Reddy Vaka, Badal Soni, Sudheer Reddy K., Breast cancer detection by
leveraging Machine Learning, ICT Express, Volume 6, Issue 4, 2020, Pages 320-324,
ISSN 2405-9595,

[7] Deepika S and Kapilaa Ramanathan Devi N (2021) Prediction of Breast Cancer
Using SVM Algorithm ISSN 0973-4562 Volume 16, Number 4 (2021) pp. 316-320

[8] Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and
prognosis. Cancer Inform. 2007;2:59-77. Published 2007 Feb 11.

[9] Cruz, Joseph A, and David S Wishart. “Applications of machine learning in cancer
prediction and prognosis.” Cancer informatics vol. 2 59-77. 11 Feb. 2007

[10] Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and
prognosis. Cancer Inform. 2007 Feb 11;2:59-77. PMID: 19458758; PMCID:
PMC2675494.

[11] Barnali Sahu, Debahuti Mishra, ―A Novel Feature Selection Algorithm using
Particle Swarm Optimization for Cancer Microarray Data‖, International Conference on
Modeling Optimization and Computing (ICMOC-2012), ELSEVIERProcedia
Engineering 38 (2012 ) pp 27–31.

[12] S. Mishra, C. D. Kaddi, and M. D. Wang, ―Pan-cancer analysis for studying


cancer stage using protein and gene expression data,‖ in 38th Annual International

57
Conference of the IEEE Engineering in Medicine and Biology Society (EMBC),
Orlando, FL, USA, 2016, pp. 2440–2443.

[13] Cuong Nguyen, Yong Wang, Ha Nam Nguyen,‖ Random forest classifier combined
with feature selection for breast cancer diagnosis and prognostic‖, J.Biomedical Science
and Engineering, 2013, 6, pp 551-560,

[14] Ammu P K, Preeja V,‖ Review on Feature Selection Techniques of DNA


Microarray Data‖, Intl. J. of Computer Applications (0975– 8887) Volume 61–No.12,
January 2013, pp 39-44.

[15] B.M.Gayathri, C.P.Sumathi, T.Santhanam, ―Breast cancer diagnosis using machine


learning algorithms–a survey‖, International Journal of Distributed and Parallel Systems
(IJDPS) Vol.4, No.3, May 2013, pp 105-112.

[16] Ahmad LG, Eshlaghy AT, Poorebrahimi A, Ebrahimi M, Razavi AR, ―Using
Three Machine Learning Techniques for Predicting Breast Cancer Recurrence‖,
Open Access, Journal of Health & Medical Informatics 2013, vol 4, issue 2. ISSN:
2157-7420, http://dx.doi.org/10.4172/2157- 7420.1000124

[17] P. Ramachandran, N.Girija, T.Bhuvaneswari, ―Early Detection and Prevention


of Cancer using Data Mining Techniques‖, International Journal of Computer
Applications (0975–8887), Volume 97– No.13, July 2014, pp 48-53.

[18] Mehdi Pirooznia, Jack Y Yang, Mary Qu Yang, Youping Deng, ―A comparative
study of different machine learning methods on microarray gene expression data‖,
BMC Genomics, Open Access BioMed Central, 2008, International Conference on
Bioinformatics & Computational Biology (BIOCOMP'07) Las Vegas, NV, USA. 25-28
June 2007, DOI: 10.1186/1471-2164-9-S1-S13.

[19] Arunanand T A, Abdul Nazeer K A, Mathew J P, Meeta Pradhant, ―A


Nature-inspired Hybrid Fuzzy C-means algorithm for Better Clustering of
Biological Data Sets‖, IEEE International Conference on Data Science & Engineering
(ICDSE ‗14), pp 76-82.

[20] Vikas Chaurasia, Saurabh Pal, ―Data Mining Techniques: To Predict and
Resolve Breast Cancer Survivability‖, International Journal of Computer Science and
Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 10-22, ISSN 2320–088X

[21] P. Yasodha, N.R. Anathanarayanan, ―Analysing Big Data to Build Knowledge


Based System for Early Detection of Ovarian Cancer‖, Indian Journal of Science and
Technology (IJST), Vol 8(14), July 2015, ISSN 0974-5645, DOI:
10.17485/ijst/2015/v8i14/65745

[22] Ammu P K, Siva Kumar K C, Sathish M, ―A BBO Based Feature Selection


Method for DNA Microarray‖, International Journal of Research Studies in Biosciences
(IJRSB) Volume 3, Issue 1, January 2015, PP 201-204, ISSN 2349-0365
58
[23] K. Sivakami, ―Mining Big Data: Breast Cancer Prediction using DT - SVM Hybrid
Model‖, International Journal of Scientific Engineering and Applied Science (IJSEAS),
Volume-1, Issue-5, August 2015, pp 418-429, ISSN: 2395-3470.

[24] Kar, Subhajit, Kaushik Das Sharma, and Madhubanti Maitra, ―A particle swarm
optimization based gene identification technique for classification of cancer
subgroups,‖ in 2nd IEEE International Conference on Control, Instrumentation, Energy
and Communication (CIEC), 2016.

[25] Hiba Asria, Hajar Mousannifb, Hassan Moatassimec, Thomas Noeld, ―Using
Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis‖,
ELSEVIER 6th International Symposium on Frontiers in Ambient and Mobile Systems
(FAMS 2016), Procedia Computer Science 83 ( 2016 ) pp 1064–1069.

59
LIST OF PUBLICATIONS

[I] Akash Pandey, Ankit Kumar Singh and Nitin Kumar Yadav(2022) Cancer Cell
detection using machine learning[accepted in] 4th IEEE ICAC3N-22

60
CONTRIBUTION OF PROJECT

1. Objective and Relevance of Project

Cancer Cell detection is one of the most important and practical issues in applications
that implement pattern recognition of cancerous cell. SVM is the process of recognizing
user-defined infected cells and classifying them into the pre-defined datasets. It has
various key applications such as disease prediction, diagnosis of cancerous cells, etc.

Due to its relevance in allowing one to understand the worldwide renowned machine
learning algorithms, the rapid use of this technology can help solve a lot of real-world
complex problems in real time.
So the objective of this project is to create a Cancer cell detection alongside a model
that is accurate enough to predict benign cancer cells in real-time provided within the
model efficiently with accuracy and robustness.

2. Expected Outcome

The expected outcome of this project is to create a model application using the standard
Tkinter python library. The model should consist of a training datasets to draw benign
cancer cells in real-time and several cancers that should be bound to various
functionalities that are triggered according to user events.

The proposed machine learning model will be using the MNIST dataset and data will be
preprocessed before being fed into the model to provide better accuracy.

3. Concerns Related to Project

3.1. Social Relevance

Cancer cell detection is one of the significant areas of research and development with a
streaming number of possibilities that could be attained.

A Cancer cell detection using machine learning is socially relevant because it has
numerous applications such as disease prediction, diagnosis of cancerous cells, etc. This
helps society by providing the facility of cancer prediction in real-time for various

61
specific applications. On the other hand, implementing this technology on a large scale
could bring about several repercussions.

3.2. Health Concern

Cancer Cell Detection using Machine Learning doesn’t necessarily pose any health
concerns. The basic aim behind this project is to formulate and use a machine-learning
algorithm to automate the process of cancer prediction. There is no direct impact or
health concern related to this project. It may have an indirect impact on the mental
health of a person in a good sense since using this model provides a sense of
satisfaction and may improve one’s efficiency ultimately saving time and manpower.

3.3. Legal Aspects

Here, our Cancer cell detection using machine learning uses the MNIST which is a free
dataset. It uses python as its core programming language which is a free & open-source
language. Libraries used for this project (TensorFlow, Keras, Tkinter) are also free and
open source. Hence, this project is completely legal and a risk management plan will be
developed to completely tackle all the obstacles faced during the completion of this
project. Also, several laws may need to be implemented by the government alongside
spreading awareness to introduce the incorporation of a Cancer cell detection
worldwide as it may be used by malicious people to cause felonies.

62

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy