Final Project Record Exmple
Final Project Record Exmple
ON
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE
Submitted in partial fulfilment of the requirements for the award of degree in
MASTER OF SCIENCE (COMPUTER SCIENCE)
SUBMITTED BY
S.VARALAKSHMI
Regd No: 18MCS024
CERTIFICATE
This is to certify that this work entitled “DIAGNOSIS OF BREAST CANCER
DISEASE USING SUPPORT VECTOR MACHINE” is bonafide work carried out by
S.VARALAKSHMI (7724) in the partial fulfilment for the award of the degree in
MASTER OF SCIENCE (COMPUTER SCIENCE) of KRISHNA UNIVERSITY,
MACHILIPATNAM during the Academic year 2018-2020. It is certify that the
corrections / suggestions indicated for internal assessment have been incorporated in the
report. The project work has been approved satisfies the academic requirements in
respect of project work prescribed for the above degree.
External Examiner
ACKNOWLEDGMENT
The satisfaction that accompanies the successful completion of any task would be incomplete
without mentioning the people who made it possible and whose constant guidance and
encouragement crown all the efforts with success. This acknowledgement transcends the reality
of formality when we would like to express deep gratitude and respect to all those people behind
the screen who guided, inspired and helped me for the completion of the work. I wish to place on
my record my deep sense gratitude to my project guide, Mr. PAVAN KUMAR, Assistant
Professor, Department of MCA for his constant motivation and valuable help throughout the
project work.
My sincere thanks to Mrs. SHAMIM, Head of the Department of M.Sc. (CS) for her guidance
regarding the project. I also extended my thanks to Dr. P. BHARATHI DEVI, Head of the
Department of MCA for her valuable help throughout the project. I also extend my thanks to
Dr. MAZHARUNNISA BEGUM for P.G. CENTRE, I extend gratitude to SRI.
S.VENKATESH, DIRECTOR for P.G. COURSES for his valuable suggestions.
S.VARALAKSHMI
Regd.NO:18MCS024
DECLARATION
I hereby declare the project work entitled “DIAGNOSIS OF BREAST CANCER DISEASE
USING SUPPORT VECTOR MACHINE” submitted to K.B.N P.G COLLEGE affiliated to
KRISHNA UNIVERSITY, has been done under the guidance of Mr. PAVAN KUMAR,
Assistant Professor, Department of MCA during the period of study in that it has found
formed the basis for the award of the degree/diploma or other similar title to any candidate of
University.
DATE:
PLACE: VIJAYAWADA
ABSTRACT
Breast cancer is an all too common disease in women, making how to effectively predict it an
active research problem. A number of statistical and machine learning techniques have been
employed to develop various breast cancer prediction models. Among them, support vector
machines (SVM) have been shown to outperform many related techniques. To construct the
SVM classifier, it is first necessary to decide the kernel function, and different kernel functions
can result in different prediction performance. However, there have been very few studies
focused on examining the prediction performances of SVM based on different kernel functions.
Moreover, it is unknown whether SVM classifier ensembles which have been proposed to
improve the performance of single classifiers can outperform single SVM classifiers in terms of
breast cancer prediction. Therefore, the aim of this project is to fully assess the prediction
performance of SVM and SVM ensembles over small and large scale breast cancer datasets. The
classification accuracy, ROC, F-measure, and computational times of training SVM and SVM
ensembles are compared.
Breast cancer prediction has long been regarded as an important research problem in the medical
and healthcare communities. This cancer develops in the breast tissue. There are several risk
factors for breast cancer including female sex, obesity, lack of physical exercise, drinking
alcohol, hormone replacement therapy during menopause, ionizing radiation, early age at first
menstruation, having children late or not at all, and older age.
There are different types of breast cancer, with different stages or spread, aggressiveness, and
genetic makeup. Therefore, it would be very useful to have a system that would allow early
detection and prevention which would increase the survival rates for breast cancer.
The literature discusses a number of different statistical and machine learning techniques that
have been applied to develop breast cancer prediction models, such as logistic regression, linear
discriminate analysis, naïve Bayes, decision trees, artificial neural networks, k-nearest neighbor,
and support vector machine methods .More specifically, studies comparing some of the above
mentioned techniques have shown that SVM performs better than many of the other related
techniques.
INDEX
6. RESULT ANALYSIS 77 – 81
7. CONCLUSION AND FUTURE SCOPE 82 – 83
8. REFERENCES 84 – 86
1. INTRODUCTION
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE
1. INTRODUCTION
The analysis of gene expressions has been an important topic in Molecular Biology. A gene
expression is the process by which a functional gene product (e.g. protein) is synthesized using
information encoded in a gene. Microarrays are high through-put, high sensitivity tools that
allow the measurement of thousands of genes simultaneously. !e conduction of microarray
experiments is expensive. Therefore , measurements often exist only for very few samples. This
leads to data analysis problems in very high features spaces with relatively few samples, which
are an interesting challenge and have been subject to multiple machine learning approaches.
The CORUM database contains manually annotated protein complexes from mammalian
organisms and is maintained by the Munich information center for protein sequences (MIPS) at
the Helmholz Zentrum Munich. Its annotations for protein-complexes include protein-complex
function, localization, subunit composition as well as literature references and more.
We propose a new data integration method using the protein-complex information from the
CORUM database to compute virtual protein- complex expressions. While the study of multi-
class problems is less frequent, we believe that support vector machines with recursive feature
elimination (SVM-RFE) in combination with our new data integration method offer a potent
solution.
To verify our approach, we have trained a multi-class support vector machine with recursive
feature elimination on a large breast-cancer dataset and have determined its performance on a
smaller breast-cancer data set from a different study. Even though we were unable to increase
classification performance on the computed virtual protein-complex expressions over gene-
expressions, our novel data integration method coupled with multi-class SVM-RFE has extracted
biologically interesting and interpretable protein complexes. Therefore, our new approach can be
seen as an interesting method to analyses and interpret data sets of this kind. Furthermore, this
can lead to new insights into the role and function of protein-complexes by using virtual protein-
complex expressions from gene expression microarray data sets
1.1. BACKGROUND
The first chapter will be a short introduction to the relevant background. It will start with
molecular biology, continued by Support Vector Machines (SVMs) and possible derived
approaches to solve multi-class problems. It will end with a brief introduction to breast cancer.
Figure: Diagram of the information ,ow between DNA, RNA and Protein according to the central
dogma of molecular biology. Black arrows denote flow of the general group and red arrows
denote ,flow of the special group.
“The central dogma of molecular biology deals with the detailed residue-by-residue transfer of
sequential information. It states that such information cannot be transferred from protein to either
protein or nucleic acid”, and elaborates this idea through-out the article. !e central dogma of
molecular biology deals with the information transfer, which can be split into three groups:
general, special and unknown. !e general group, which occurs in all cells with minor exceptions,
consists of the following three transports:
• RNA → Protein.
The special group does not occur in most cells, but has been shown to be present in virus
infected cells with ,ows from: RNA → RNA and RNA → DNA.
A third case for the special group: DNA → Protein ,flow has only been seen in cell-free systems
containing neomycin. !e other information ,flows are postulated by the central dogma to never
occur;
these are Protein → Protein, Protein → DNA and Protein → RNA, and belong to the unknown
group (Crick, 1970).
Genes:
We divide all cells into two basic types: prokaryotes and eukaryotes. The defining structure that
sets eukaryotic cells apart from prokaryotic cells is the nucleus. Prokaryotic cells have cell walls
containing glycopeptides. Whereas eukaryotic organisms are made from cells that are organized
by complex structures within membranes (Okafur, 2007, p.17). Human beings are eukaryotes
and contain billions of individual cells.
Almost all of these cells contain, within each nucleus, the complete hereditary information for
the organism in form of the genome. The genome consists of DNA and can be seen as the
blueprint for all structures of the organism. The human genome is made from 23 pairs of
chromosomes, where each pair is based on the chromosome pairs from their biological parents. !
e chromosomes contain chains of DNA, which consist of two polymers. These polymers are
wrapped around each other and form a structure known as double helix. The polymer strands are
held together by hydrogen bonds. They are large molecules of repeating monomers, which are
called nucleotides. Each nucleotide is made from deoxyribose sugar, a phosphate group and one
of the following four nitrogen bases: adenine, cytosine, guanine and thymine, usually represented
by their first letter A, C, G and T. Due to a property of the nitrogen bases (called complementary
base pairing) one can deduct from one strand of DNA the other complementary strand; in
particular, adenine can only form hydrogen bonds with cytosine, and guanine with thymine. The
sequence of these nucleotides in the double helix encode the hereditary genetic information.
and can therefore be described by their ordered sequence of nitrogen bases. !e length of these
sequences can be hundreds of thousands of bases. The sequences encode particular patterns, of
which the exact number in the human genome is unknown. !e number of protein- coding genes is
estimated between 20,000 and 25,000 genes. To have the hereditary information available within
all cells, the DNA has to reside in each cell. Therefore, the DNA needs to be replicated to create
new cells, which is essential to multi-cell organisms. !e so-called DNA replication is the first of
the general information flows described by the central dogma of molecular biology. To form a
protein from a protein- coding gene, the gene information has to flow from the gene to the
messenger RNA (mRNA) using a process called transcription, and from the mRNA to the final
protein through the so-called translation. These are the second and third information flows as
postulated by the central dogma.
DNA replication:
Because complete DNA is contained within each cell of the organism, it needs to be replicated
upon cell division (cytokinesis). In eukaryotes, the replication is triggered at the end of the
interphase. !e interphase is followed by the separation of the chromosomes (mitosis), and the
immediate cytokinesis. During the DNA replication process the double helix is split into its
strands and each strand’s complementary pair is synthesized using an enzyme called DNA-
polymerase.
Transcription:
Pearson (2006) noted that the genes consist of regions that are regulatory as well as regions that
explicitly code for a protein. One of these regulatory regions is known as promoter, and is used
by the RNA-polymerase that drives the RNA synthesis. The RNA synthesis is similar to the
DNA synthesis, with the notable difference that only one strand is copied. Eukaryotic genes
consist of exons and introns, which are DNA regions within a gene. The difference between
exons and introns is that the final mRNA only represents the exons. During transcription, first
precursor mRNA (pre-mRNA) is transcribed from the DNA strand1 1. !e reverse way is used in
biotechnology as well as retroviruses, which include the HI / AIDS virus, where a single-
stranded RNA is transcribed to a single-stranded DNA. (Okafur, 2007, p.36)
A process called splicing later removes the introns from the pre-mRNA to yield the final one
stranded mRNA, which only consists of exons. The mRNA leaves the nucleus via the nuclear
membrane.
Translation:
Outside of the nucleus the mRNA is used as a template for the synthesis of proteins, which is
called translation. It contains the nucleotide uracil (U) instead of thymine. !e translation is done
by ribosomes, which are large complexes of proteins. These ribosomes read the genetic
information carried by the mRNA molecules in triplets of nucleotides, and combine any of the 20
amino acids in the human body into complex polypeptide chains through chemical reactions.
These triplets are called codons. The translation will start on the codon AUG and select
phenylalanine if it reads the codon UUU, or glycine on GGG2 2. These are examples, a complete
list can be found in Okafur (2007,p.38, Table 3.1); but if it reads the codon UAA, UAG or UGA
the translation will be terminated (Okafur, 2007, p.38).The resulting polypeptide chains then
form the protein.
The process, in which the information from a gene is used to synthesize a functional gene
product or protein via transcription and translation, is known as gene expression.
Proteins:
Proteins are macromolecules; they form the building blocks of the organism and are responsible
for numerous functions inside the living organism. Proteins are important for the metabolism,
which is the maintaining of life in living things through chemical reactions. As hormones,
proteins transport messages through the body. For example, the protein hemoglobin transports
oxygen. Proteins are also the basis for enzymes, catalysts for chemical reactions. Proteins form
DEPARTMENT OF M.Sc (CS) 5 KBN PG COLLEGE
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE
3-dimensional structures by folding their amino acid backbone. !e protein’s shape depends on the
process it guides (Coen, 1999).
The linear chain of amino acids resulting from the translation phase is a so-called random-coil, a
simple polypeptide. To form a proper protein the random-coil needs to fold into a well-defined
three-dimensional structure, which defines its characteristic and function. !is process is driven by
the interaction of the amino-acids from the random-coils during and alter the protein synthesis. !e
correct fold is essential for proteins to function correctly, and failure results in inactive proteins
or toxins. Inside the human body, there are many different types of proteins. When several
proteins with different polypeptide chains form a complex where each polypeptide chain
contains different protein domains, the result can have multiple catalytic functions, and is called
a multi-protein complex or protein-complex.
Multiprotein Complexes:
Today, we know that the cell’s dry mass consists mostly of proteins. These protein molecules
form protein assemblies that carry out most of the major processes in the cell. These little
machines of ten or more protein molecules perform complex biological functions where each
assembly interacts with other protein-complexes (assemblies). These functions include cell cycle,
protein degradation and protein folding(Alberts, 1998).
A similar definition is given by Ruepp et al. (2007, 2010) in CORUM: the Comprehensive
Resource of Mammalian Protein Complexes with a slightly stronger emphasis on gene
dependence:
“Protein complexes are key molecular entities that integrate multiple gene products to perform
cellular functions.”
The field of proteomics, the large-scale study of proteins, can be divided into cell-map and
expression proteomics. The large-scale, quantitative study of protein-protein interactions through
their isolation of protein complexes is called cell-map proteomics. It studies in particular the
structure and function of the proteins contained within the protein-complexes. !e study of protein
expression changes is called expression proteomics. !e availability of complete sequences of the
genome has shifted the focus towards functional interpretation of genomics (Blackstock and
Weir, 1999). This has led to the creation of large scale databases containing protein-complexes,
subunits and functional description. One of the largest, freely accessible databases is the
CORUM database maintained as part of the Munich Information Center for Protein Sequences
(MIPS),
In present study, method consisting of 4 main phases was proposed for diagnosis of breast
cancer using SVM as feature extraction.
2. REQUIREMENT ANALYSIS
2.1. HARDWARE
REQUIREMENTS
Windows 10
Python
Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released
in 2000, introduced features like list comprehensions and a garbage collection system capable of
collecting reference cycles. Python 3.0, released in 2008, was a major revision of the language
that is not completely backward-compatible, and much Python 2 code does not run unmodified
on Python 3.
The Python 2 language, i.e. Python 2.7.x, was officially discontinued on 1 January 2020 (first
planned for 2015) after which security patches and other improvements will not be released for
it. With Python 2's end-of-life, only Python 3.5.x and later are supported.
Python is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant
syntax and dynamic typing, together with its interpreted nature, make it an ideal language for
scripting and rapid application development in many areas on most platforms.
The Python interpreter is easily extended with new functions and data types implemented in C or
C++ (or other languages callable from C). Python is also suitable as an extension language for
customizable applications.
HISTORY:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired
by SETL), capable of exception handling and interfacing with the Amoeba operating system. Its
implementation began in December 1989.[36] Van Rossum shouldered sole responsibility for the
project, as the lead developer, until 12 July 2018, when he announced his "permanent vacation"
from his responsibilities as Python's Benevolent Dictator For Life, a title the Python community
bestowed upon him to reflect his long-term commitment as the project's chief decision-maker.
[37]
He now shares his leadership as a member of a five-person steering council. In January 2019,
active Python core developers elected Brett Cannon, Nick Coghlan, Barry Warsaw, Carol
Willing and Van Rossum to a five-member "Steering Council" to lead the project.[41]
Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-
detecting garbage collector and support for Unicode.
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not
completely backward-compatible.[43] Many of its major features were backported to Python
2.6.x and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which automates (at
least partially) the translation of Python 2 code to Python 3.
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that
a large body of existing code could not easily be forward-ported to Python 3.
Python Libraries
After Modules and Python Packages, we shift our discussion to Python Libraries. This Python
Library, we will discuss Python Standard library and different libraries offered by Python
Programming Language: pandas, Matplotlib, scipy, numpy, etc.
We know that a module is a file with some Python code, and a package is a directory for sub
packages and modules. But the line between a package and a Python library is quite blurred.
A Python library is a reusable chunk of code that you may want to include in your programs/
projects. Compared to languages like C++ or C, a Python libraries do not pertain to any specific
context in Python. Here, a ‘library’ loosely describes a collection of core modules. Essentially,
then, a library is a collection of modules. A package is a library that can be installed using a
package manager like rubygems or npm.
The Python Standard Library is a collection of exact syntax, token, and semantics of Python. It
comes bundled with core Python distribution. We mentioned this when we began with an
introduction.
It is written in C, and handles functionality like I/O and other core modules. All this functionality
together makes Python the language it is. More than 200 core modules sit at the heart of the
standard library. This library ships with Python. But in addition to this library, you can also
access a growing collection of several thousand components from the Python Package Index
(PyPI).
Matplotlib
Matplotlib helps with data analyzing, and is a numerical plotting library. We talked about it
in Python for Data Science.
Matplotlib is a widely used python based library; it is used to create 2d Plots and graphs easily
through Python script, it got another name as a pyplot. By using pyplot, we can create plotting
easily and control font properties, line controls, formatting axes, etc. It also provides a massive
variety of plots and graphs such as bar charts, histogram, power spectra, error charts, etc.
Pandas
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of
data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Numpy
Python is increasingly being used as a scientific language. Matrix and vector manipulations are
extremely important for scientific computations. Both NumPy and Pandas have emerged to be
essential libraries for any scientific computation, including machine learning, in python due to
their intuitive syntax and high-performance matrix computation capabilities.
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of
Python which provides fast mathematical computation on arrays and matrices. Since, arrays and
matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine
Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python
Machine Learning Ecosystem.
>>Import numpy as np
NumPy’s main object is the homogeneous multidimensional array. It is a table with same type
elements, i.e, integers or string or characters (homogeneous), usually integers. In NumPy,
dimensions are called axes. The number of axes is called the rank.
There are several ways to create an array in NumPy like np.array, np.zeros, no.ones, etc. Each of
them provides some flexibility.
DEPARTMENT OF M.Sc (CS) 16 KBN PG COLLEGE
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE
3. REVIEW OF LITERARTURE
EPIDEMIOLOGY
Breast cancer is the most commonly occurring cancer in women, comprising almost one third of
all malignancies in females. It is second only to lung cancer as a cause of cancer mortality, and it
is the leading cause of death for American women between the ages of 40 and 55.1 The lifetime
risk of a woman developing invasive breast cancer is 12.6 % 2 one out of 8 females in the United
States will develop breast cancer at some point in her life.2 The death rate for breast cancer has
been slowly declining over the past decade, and the incidence has remained level since 1988
after increasing steadily for nearly 50 years.3 Twenty-five percent to 30% of women with
invasive breast cancer will die of their disease.1 But this statistic, as grim as it is, also means that
70% to 75% of women with invasive breast cancer will die of something other than their breast
cancer. Hence, a diagnosis of breast cancer, even invasive breast cancer, is not necessarily the
‘‘sentence of death’’ that many women (and their insurance companies) imagine. Mortality rates
are highest in the very young (less than age 35) and in the very old (greater than age 75).4 It
appears that the very young have more aggressive disease, and that the very old may not be
treated aggressively or may have comorbid disease that increases breast cancer fatality.5
Although 60% to 80% of recurrences occur in the first 3 years, the chance of recurrence exists
for up to 20 years.
Ninety-five percent of breast cancers are carcinomas, ie, they arise from breast epithelial
elements. Breast cancers are divided into 2 major types, in situ carcinomas and invasive (or
infiltrating) carcinomas. The in situ carcinomas may arise in either ductal or lobular epithelium,
but remain confined there, with no invasion of the underlying basement membrane that would
constitute extension beyond epithelial boundaries. As would be expected with such localized and
confined malignancy, there is negligible potential for metastases. When there is extension of the
ductal or lobular malignancy beyond the basement membrane that constitutes the epithelial
border, then the malignancy is considered invasive (or infiltrating) ductal or lobular carcinoma.
The potential for metastases and ultimately death occurs in invasive disease.
Breast cancer incidence is highest in North America and Northern Europe and lowest in Asia and
Africa. Studies of migration patterns to the United States suggest that genetic factors alone do
not account for the incidence variation among countries, as the incidence rates of second-, third-
and fourth-generation Asian immigrants increase steadily in this country. Thus, environmental
and/or lifestyle factors appear to be important determinants of breast cancer risk.5 Gender is by
far the greatest risk factor. Breast cancer occurs 100 times more frequently in women than men.
In women, incidence rates of breast cancer rise sharply with age (see Table 1) until ages 45 to 50,
when the rise becomes less steep.4 This change in slope probably reflects the impact of hormonal
change (menopause) that occurs about this time. By ages 75 to 80, the curve actually flattens and
then decreases. Despite the steepness of the incidence curve at younger ages, the more important
issue is the increasing prevalence of breast cancer with advancing age, and the takehome
message for physicians and underwriters alike is that any breast mass in a postmenopausal
woman should be considered cancer until proven otherwise.8 Genetics plays a limited but
important role as a risk factor for breast cancer.
Only 5% to 6% of breast cancers are considered hereditary.9 BRCA-1 and BRCA-2 account for
an estimated 80% of hereditary breast cancer, but again, this only represents 5% to 6% of all
breast cancers. BRCA-1 and/or BRCA-2 positive women have a 50% to 85% lifetime risk of
developing breast cancer (see Table 1), and a 15% to 65% risk of developing ovarian cancer,
beginning at age 25.10 Familial breast cancer is considered a risk if a first-degree relative
develops breast cancer before menopause, if it affected both breasts, or if it occurred in
conjunction with ovarian cancer.
11 There is a 2-fold relative risk of breast cancer if a woman has a single firstdegree relative
(mother, sister or daughter). There is a 5-fold increased risk if 2 first-degree relatives have had
breast cancer.12 A woman’s hormonal history appears to be a risk factor, as the relative risk of
breast cancer seems to be related to the breast’s cumulative exposure to estrogen and
progesterone.
Early menarche (onset of menstruation , age 13), having no children or having them after age
30, and menopause after age 50 and especially age 55—all these mean more menstrual cycles
and thus greater hormone exposure.13 The Women’s Health Initiative (WHI), a randomized
controlled trial of 16,608 postmenopausal women comparing effects of estrogen plus progestin
with placebo on chronic disease risk, confirmed that combined estrogen plus progestin use
increases the risk of invasive breast cancer.14 Hormone replacement therapy (HRT) users have a
breast cancer risk that is 53% higher for combination therapy and 34% higher for estrogen alone,
especially if used for more than 5 years.
Although earlier studies suggested that this increased risk of cancer was offset by the fact that the
cancers induced by HRT were of more benign pathology and had a more favorable prognosis,4
reevaluation of the WHI data reveals this impression to be incorrect. Invasive breast cancers
associated with estrogen plus progestin use were larger (l.7 cm vs 1.5 cm, p 5 0.04), were more
likely to be node positive (26% vs 16%, p 5 0.03), and were diagnosed at a significantly more
advanced stage (regional/metastatic 25.4% vs 16%, p 5 0.04). The percentages and distribution
of invasive ductal, invasive lobular, mixed ductal, and lobular as well as tubular carcinomas were
similar in the estrogen plus progestin group vs the placebo group.15 Over observation time as
short as a year, there was a statistically significant increase in breast density in the estrogen plus
progestin group, resulting in increased incidence of abnormal mammograms (9.4% vs 5.4%,
p,0.001).
As noted by Gann and Morrow in a JAMA editorial, ‘‘the ability of combined hormone therapy
to decrease mammographic sensitivity creates an almost unique situation in which an agent
increases the risk of developing a disease while simultaneously delaying its detection.’’16 Li et
al reported that women using unopposed estrogen replacement therapy (ERT) had no appreciable
increase in the risk of breast cancer. However, use of combined estrogen and progestin hormone
replacement therapy had an overall 1.7-fold (95% CI 1.3– 2.2) increased risk of breast cancer,
including a 2.7-fold (95% CI 1.7–4.3) increased risk of invasive lobular carcinoma, a 1.5-fold
(95% CI, 1.1–2.0) increased risk of invasive ductal carcinoma, and a 2-fold (95% CI 1.5–2.7)
increased risk of ER1/PR1 breast cancers.17 Other risk factors for breast cancer include alcohol,
which has been linked to increased blood levels of estrogen interfering with folate metabolism
that protects against tumor growth. Women who drink .2 ounces of alcohol per day are 40%
more likely to develop breast cancer than women who drink no alcohol.
This is an issue of great concern for patients, physicians and insurance companies alike, as there
are conditions that confer no risk of malignancy and others that definitely confer increased risk.
Breast biopsies conferring no significantly increased risk for malignancy include any lesion with
non-proliferative change.25,26 These include duct ectasia and simple fibroadenomas, benign
solid tumors containing glandular as well as fibrous tissue. The latter is usually single but may be
multiple. Solitary papillomas are also benign lesions conferring no increased risk of future
malignancy, despite the fact that they are often (in 21 of 24 women in a single study27) with
sanguineous or serosanguineous nipple discharge. Fibrocystic change (cysts and/or fibrous tissue
without symptoms) or fibrocystic disease (fibrocystic changes occurring in conjunction with
pain, nipple discharge, or a degree of lumpiness sufficient to cause suspicion of cancer) does not
carry increased risk for cancer (other than the potential for missing a malignant mass).
Some clinicians differentiate fibrocystic change or disease into those of hyperplasia, adenosis,
and cystic change because of their differentiation into age distributions. Hyperplasia
characteristically occurs in women in their 20s, often with upper outer quadrant breast pain and
an indurated axillary tail, as a result of stromal proliferation. Women in their 30s present with
solitary or multiple breast nodules 2–10 mm in size, as a result of proliferation of glandular cells.
Women in their 30s and 40s present with solitary or multiple cysts. Acute enlargement of cysts
may cause pain, and because breast ducts are usually patent, nipple discharge is common with
the discharge varying in color from pale green to brown.29 Conditions with increased risk of
malignancy include ductal hyperplasia without atypia. This is the most commonly encountered
breast biopsy result that is definitely associated with increased risk of future development of
breast cancer and confers a 2-fold increased risk.
The number, size and shape of epithelial cells lining the basement membrane of ducts are
increased, but the histology does not fulfill criteria for malignancy. The loss of expression of
transforming growth factor-b receptor II in the affected epithelial cells is associated with an
increased risk of invasive breast cancer.30 A number of other benign lesions also confer a
roughly 2-fold increased risk for development of breast cancer. These include sclerosing
adenosis, where lobular tissue undergoes hyperplastic change with increased fibrous tissue and
interspaced glandular cells, diffuse papillomatosis which is the formation of multiple papillomas,
and fibroadenomas with proliferative disease, which are tumors that contain cysts greater than 3
mm in diameter, with sclerosing adenosis, epithelial calcification, or papillary apocrine change.
Radial scars are benign breast lesions of uncertain pathogenesis, which are usually discovered
incidentally when a breast mass is removed for other reasons. Radial scars are characterized by a
fibroelastic core from which ducts and lobules radiate.31
As breast cancer rarely causes pain, a painless mass is much more worrisome for malignancy
than is one causing symptoms. Mammography done yearly beginning at age 40 is the current
recommendation for women with no risk factors.35 The most commonly encountered
categorization of mammography findings is summarized in Table 2. Although mammograms
may detect malignancy as small as 0.5 cm, 10% to 20% of malignancies elude detection by
mammography, even when they occur at a much larger size.36 In a patient with a solid,
dominant mass (suspicious mass) the primary purpose of the mammogram is to screen the
normal surrounding breast tissue and the opposite breast for nonpalpable cancers, not to make a
diagnosis of the palpable mass.8 Thus, a negative mammogram is no guarantee of absence of
malignancy, and a mass that does not disappear or collapse with aspiration must be assumed to
be a malignancy and biopsied.
THE BIOPSY There are 3 methods of obtaining material from a suspicious breast lump. Fine-
needle aspiration is not a reliable means of diagnosis, because it cannot distinguish ductal
carcinoma in situ from invasive cancer and it may lead to a false-negative result.1 Fine needle
aspiration (FNA) is generally reserved for palpable cyst-like lumps visible on a mammogram or
ultrasound. False positives are negligible but false-negative results occur in 15% to 20%, leading
to the recommendation that if the cyst or lump doesn’t disappear with FNA, further biopsy is
mandatory.8 Core needle biopsy has generally replaced fine needle aspiration in all but obvious
cysts. Core needle biopsies fail to identify areas of invasion in approximately 20% of cases
which are originally diagnosed as ductal carcinoma in situ. Atypical ductal hyperplasia in a core
needle biopsy has a relatively high incidence of coexistent carcinoma (approximately 50%). This
diagnosis, therefore, demands excisional biopsy
Intraductal (or ductal) carcinoma in situ (DCIS) is the proliferation of malignant epithelial cells
confined to ducts, with no evidence of invasion through the basement membrane. Prior to
mammography, DCIS was an uncommon diagnosis. With the introduction of routine
mammography, the ageadjusted incidence of DCIS rose from 2.3 to 15.8 per 100,000 females, a
587% increase. New cases of invasive breast cancer increased 34% over the same time period.48
About 85% of all intraductal cancers, often less than 1 cm, are discovered by the appearance of
clustered microcalcifications on mammography. Other conditions, including sclerosing adenosis
and atypical ductal hyperplasia, may also present on mammography with microcalcifications.
Morphology of the microcalcifications is the most important factor in differentiating benign from
malignant calcification.
change, fibroadenoma, or a mass suspected as being cancer.74 LCIS is more often detected in
premenopausal than postmenopausal women, suggesting a hormonal influence in the
development or maintenance of these lesions.75,76 LCIS requires no specific therapy per se.
Although the cells of LCIS are in fact small, well-differentiated neoplastic cells, they do not
behave as a true malignant neoplasm in that these cells may distend and distort the terminal-
lobular units, but invasion of and through the basement membrane does not occur, so the lesion
never results in invasive breast malignancy
At initial diagnosis, over 50% of breast cancers are stages 0 or I,78 and 75% are Stage 0, I, or II.
(Table 4)79 The quantity of lymph node involvement has a profound impact on survival. Stage
IIA cancer (T0-T1, N1) with only 1 involved lymph node has a 10-year disease-free survival of
71% and a 20-year disease-free survival of 66%. If 2 to 4 lymph nodes are involved, the 10-year
disease-free survival is 62% and the 20-year disease-free survival is 56%.
The Consensus Development Conference on the Treatment of Early-Stage Breast Cancer (June
1990, NCI) has concluded that breast conservation treatment is an appropriate method of primary
therapy for the majority of women with Stage I and Stage II breast cancers. This treatment is
preferable in many cases because it provides survival equivalent to total mastectomy and axillary
dissection while preserving the breast.80 Subsequent studies have confirmed that there is no
difference in long-term survival between surgical removal of the breast (mastectomy) and
excision of the tumor mass and radiation therapy to residual breast tissue (breast conservation
therapy).81–83 Breast-conserving surgery includes lumpectomy, re-excision, partial
mastectomy, quadrantectomy, segmental excision, and wide excision.
Axillary lymph nodes are removed for evaluation through a separate incision. The most
common breast-removal procedure is a modified-radical mastectomy, which involves making an
elliptical incision around an area including the nipple and biopsy scar, removing that section, and
tunneling under the remaining skin to remove the breast tissue and some lymph nodes. Radical
mastectomy, which removes the entire breast, chest wall muscles, and all axillary lymph nodes,
is rarely done today because it offers no survival advantage over a modified radical mastectomy.
A simple, or total mastectomy, removes the entire breast but none of the axillary lymph nodes.
This is usually done for women with DCIS, or prophylactically for women at especially high risk
for developing breast cancer. A newer procedure is the skinsparing mastectomy, which involves
removing the breast tissue through a circular incision around the nipple and replacing the breast
with fat taken from the abdomen or back.
One trial randomly assigned breast cancer survivors to either a specialist or a family physician,
and found no differences between the 2 groups in measured outcomes, including time to
diagnosis of recurrence, anxiety, or health-related quality of life.92 A subsequent economic
analysis of this study found the quality of life as measured by frequency and length of patient
visits and costs were better when follow-up was provided by the family physician as compared to
the specialist.93 Routine history and physical examination and regularly scheduled
mammograms are the mainstay of care for the breast cancer survivor.94 Recurrence of breast
cancer is more frequently discovered by the patient (71%) than by her physician (15%).6 Women
should be encouraged to perform breast self-examination monthly.
Mammograms should be done at 6 and 12 months after surgery and then yearly thereafter.
Several tumor-associated antigens, including CA 15-3 and CEA, may detect breast cancer
recurrence, but not with sufficient sensitivity and specificity to be routinely used by either
clinicians95 or insurance underwriters. A newer marker, CA 27.29, showed promise in one well-
designed study of 166 women with stage II and III breast cancer. The sensitivity and specificity
of this test were 58% and 98%, respectively. Recurrence was detected approximately 5 months
earlier than with routine surveillance.96 However, improvement in survival or quality of life
using this marker has not yet been proven.
Neither routine chest x-rays nor serial radionucleotide bone scans have been found to be useful
in detecting metastatic disease in asymptomatic women.
An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible. New
examples are then mapped into that same space and predicted to belong to a category based on
the side of the gap on which they fall.
When data are unlabeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to groups, and
then map new data to these formed groups.
Maximal-Margin Classifier
The Maximal-Margin Classifier is a hypothetical classifier that best explains how SVM works in
practice.
The numeric input variables (x) in your data (the columns) form an n-dimensional space. For
example, if you had two input variables, this would form a two-dimensional space.
A hyper plane is a line that splits the input variable space. In SVM, a hyper plane is selected to
best separate the points in the input variable space by their class, either class 0 or class 1. In two-
dimensions you can visualize this as a line and let’s assume that all of our input points can be
completely separated by this line. For example:
Where the coefficients (B1 and B2) that determine the slope of the line and the intercept (B0) are
found by the learning algorithm, and X1 and X2 are the two input variables.
You can make classifications using this line. By plugging in input values into the line equation,
you can calculate whether a new point is above or below the line.
Above the line, the equation returns a value greater than 0 and the point belongs to the
first class (class 0).
Below the line, the equation returns a value less than 0 and the point belongs to the
second class (class 1).
A value close to the line returns a value close to zero and the point may be difficult to
classify.
If the magnitude of the value is large, the model may have more confidence in the
prediction.
The distance between the line and the closest data points is referred to as the margin. The best or
optimal line that can separate the two classes is the line that as the largest margin. This is called
the Maximal-Margin hyper plane.
The margin is calculated as the perpendicular distance from the line to only the closest points.
Only these points are relevant in defining the line and in the construction of the classifier. These
points are called the support vectors. They support or define the hyper plane.
The hyper plane is learned from training data using an optimization procedure that maximizes
the margin.
In practice, real data is messy and cannot be separated perfectly with a hyper plane.
The constraint of maximizing the margin of the line that separates the classes must be relaxed.
This is often called the soft margin classifier. This change allows some points in the training data
to violate the separating line.
An additional set of coefficients are introduced that give the margin wiggle room in each
dimension. These coefficients are sometimes called slack variables. This increases the
complexity of the model as there are more parameters for the model to fit to the data to provide
this complexity.
A tuning parameter is introduced called simply C that defines the magnitude of the wiggle
allowed across all dimensions. The C parameters defines the amount of violation of the margin
allowed. A C=0 is no violation and we are back to the inflexible Maximal-Margin Classifier
described above. The larger the value of C the more violations of the hyper plane are permitted.
During the learning of the hyper plane from data, all training instances that lie within the
distance of the margin will affect the placement of the hyper plane and are referred to as support
vectors. And as C affects the number of instances that are allowed to fall within the margin, C
influences the number of support vectors used by the model.
The smaller the value of C, the more sensitive the algorithm is to the training data (higher
variance and lower bias).
The larger the value of C, the less sensitive the algorithm is to the training data (lower
variance and higher bias).
The learning of the hyper plane in linear SVM is done by transforming the problem using some
linear algebra, which is out of the scope of this introduction to SVM.
A powerful insight is that the linear SVM can be rephrased using the inner product of any two
given observations, rather than the observations themselves. The inner product between two
vectors is the sum of the multiplication of each pair of input values.
For example, the inner product of the vectors [2, 3] and [5, 6] is 2*5 + 3*6 or 28.
The equation for making a prediction for a new input using the dot product between the input (x)
and each support vector (xi) is calculated as follows:
This is an equation that involves calculating the inner products of a new input vector (x) with all
support vectors in training data. The coefficients B0 and ai (for each input) must be estimated
from the training data by the learning algorithm.
The kernel defines the similarity or a distance measure between new data and the support
vectors. The dot product is the similarity measure used for linear SVM or a linear kernel because
the distance is a linear combination of the inputs.
Other kernels can be used that transform the input space into higher dimensions such as a
Polynomial Kernel and a Radial Kernel. This is called the Kernel Trick.
It is desirable to use more complex kernels as it allows lines to separate the classes that are
curved or even more complex. This in turn can lead to more accurate classifiers.
Where the degree of the polynomial must be specified by hand to the learning algorithm. When
d=1 this is the same as the linear kernel. The polynomial kernel allows for curved lines in the
input space.
Finally, we can also have a more complex radial kernel. For example:
Where gamma is a parameter that must be specified to the learning algorithm. A good default
value for gamma is 0.1, where gamma is often 0 < gamma < 1. The radial kernel is very local
and can create complex regions within the feature space, like closed polygons in two-
dimensional space.
You can use a numerical optimization procedure to search for the coefficients of the hyper plane.
This is inefficient and is not the approach used in widely used SVM implementations
like LIBSVM. If implementing the algorithm as an exercise, you could use stochastic gradient
descent.
There are specialized optimization procedures that re-formulate the optimization problem to be a
Quadratic Programming problem. The most popular method for fitting SVM is the Sequential
Minimal Optimization (SMO) method that is very efficient. It breaks the problem down into sub-
problems that can be solved analytically (by calculating) rather than numerically (by searching or
optimizing).
This section lists some suggestions for how to best prepare your training data when learning an
SVM model.
Numerical Inputs: SVM assumes that your inputs are numeric. If you have categorical
inputs you may need to covert them to binary dummy variables (one variable for each category).
Binary Classification: Basic SVM as described in this post is intended for binary (two-
class) classification problems. Although, extensions have been developed for regression and
multi-class classification.
Support Vector Machines (SVMs) are large margin classifiers. The margin can be understood as
the distance of the example to the separation boundary. The large margin classifier generates a
decision boundary with a large margin to almost all training examples. SVMs have been
introduced by Boser et al. (1992), and can be used for binary classification problems. SVMs are
no machines in the classical sense, and do not consist of any tangible parts. SVMs are merely
mathematical algorithms for pattern-matching. To understand the mathematical model, a few
definitions are needed.
Notation 1 (Scalar Product). Let Vn be an n-dimensional vector space, for x, y ∈ Vn the scalar
product will be denoted by:
⟨x, y⟩ ∶= ∑i xi ⋅ yi .
Notation 2 (Euclidean Norm). Let x ∈ Rn, the euclidean norm will be written as follows:
∣∣x∣∣ ∶=√⟨x, x⟩.
A hyper plane is an (n −1)-dimensional object in the same sense that in a three dimensional space
a two dimensional object can be seen as a plane. !e formal definition follows:
Definition (Hyper plane). A hyper plane h ⊂ Rn can be explicitly defined by its perpendicular
vector w ∈ Rn and its distance b ∈ R from the origin:
h ∶= {x ∈ Rn ∶ ⟨x,w⟩ = b}.
If b = 0 the hyper plane is said to be unbiased. !e distance of the hyper plane to the origin is
b∣∣w∣∣ units.
function f ∶ Rn ↦ {−1,1} can then be derived from the hyper plane: f (x) = sign(⟨x,w⟩ − b). (2.9)
!e hyper plane is not unique as figure 2.4a shows; but only one hyper plane has the broadest
margin. !e margin describes the shortest distance from any point in S to h as depicted in figure
2.4b. !is leads to the formulation of the Support Vector Machine:
Definition (Support Vector Machine). !e support vector machine finds the hyper plane h with the
broadest margin for a linearly separable binary classification problem as the solution to
Min w,b1/2∣∣w∣∣2 subject to ∀(xi , yi) ∈ S ∶ yi(⟨xi ,w⟩ − b) ≥ 1.
!is optimization problem is known as the primal form of the SVM.
It can also be solved in its dual form.
Theorem (Dual Form of the Support Vector Machine). !e support vector machine can also be
solved in its dual form, and the optimal solution of the primal and dual form coincide.
Proof. Using the generalized Lagrangian, the primal form can be written as:
Min w,b max α1/2∣∣w∣∣2 − ∑i αi (yi(⟨xi ,w⟩ − b) − 1)
Because the optimization problem is convex and Slater’s condition is satisfied, strong duality
holds:max0≤α minw,b L(w, b, α) = minw,b
max0≤α L(w, b, α).
!is implies that the optimal solution from the primal and dual form coincide. From the optimal
solution a∗ of the dual form, the primal variables w∗ and b∗ can be computed. From the Karush-
Kuhn-Tucker (KKT) conditions, the complementary slackness:
∀i ∶ α∗i (1 − yi(⟨xi ,w∗⟩ − b∗)) = 0 is used to compute b∗. It implies that ∀i ∶ α∗i > 0 the point xi
lies on the margin: yi(⟨xi ,w∗⟩ = 1). !is can also be seen in %gure 2.5b.
w∗ = ∑i α∗i yixi b∗ = 1/2 ( maxi∶yi=−1 ∧ α∗i >0⟨w∗, xi⟩ + mini∶yi=1 ∧ α∗i >0problems, a
penalty parameter3 3. Note on notation: if a variable x ∈ R is used where a vector is expected, x
implicitly represents the vector (x, x,..., x)t of the required dimension.
C ∈ R+ as well as slack variables ζi ∈ R+ for each point in S are introduced. ζ will denote the
vector (ζ1,ζ2,...)t!e slack variables are used to so then the constraints on the optimization
problem while the penalty parameter allows to adjust the impact of this lack variables on the
objective function .⟨w∗, xi⟩) (2.19)
The blue circles in the plot represent females and green squares represents male. A few expected
insights from the graph are :
If we were to see an individual with height 180 cms and hair length 4 cms, our best guess will be
to classify this individual as a male. This is how we do a classification analysis.
Support Vectors are simply the co-ordinates of individual observation. For instance, (45,150) is a
support vector which corresponds to a female. Support Vector Machine is a frontier which best
segregates the Male from the Females. In this case, the two classes are well separated from each
other, hence it is easier to find a SVM.
There are many possible frontier which can classify the problem in hand. Following are the three
possible frontiers.
How do we decide which is the best frontier for this particular problem statement?
The easiest way to interpret the objective function in a SVM is to find the minimum distance of
the frontier from closest support vector (this can belong to any class). For instance, orange
frontier is closest to blue circles. And the closest blue circle is 2 units away from the frontier.
Once we have these distances for all the frontiers, we simply choose the frontier with the
maximum distance (from the closest support vector). Out of the three shown frontiers, we see the
black frontier is farthest from nearest support vector (i.e. 15 units).
Our job was relatively easier finding the SVM in this business case. What if the distribution
looked something like as follows :
In such cases, we do not see a straight line frontier directly in current plane which can serve as
the SVM. In such cases, we need to map these vector to a higher dimension plane so that they get
segregated from each other. Such cases will be covered once we start with the formulation of
SVM. For now, you can visualize that such transformation will result into following type of
SVM.
Each of the green square in original distribution is mapped on a transformed scale. And
transformed scale has clearly segregated classes.
Support Vector Machines are very powerful classification algorithm. When used in conjunction
with random forest and other machine learning tools, they give a very different dimension to
ensemble models. Hence, they become very crucial for cases where very high predictive power is
required. Such algorithms are slightly harder to visualize because of the complexity in
formulation. You will find these algorithm very useful to solve some of the Kaggle problem
statement.
Suppose you are given plot of two label classes on graph as shown in image (A). Can you decide
a separating line for the classes?
Image A: Draw a line that separates black circles and blue squares.
You might have come up with something similar to following image (image B). It fairly
separates the two classes. Any point that is left of line falls into black circle class and on right
falls into blue square class. Separation of classes. That’s what SVM does. It finds out a line/
DEPARTMENT OF M.Sc (CS) 42 KBN PG COLLEGE
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE
hyper-plane (in multidimensional space that separate outs classes). Shortly, we shall discuss why
I wrote multidimensional space.
So far so good. Now consider what if we had data as shown in image below? Clearly, there is no
line that can separate the two classes in this x-y plane. So what do we do? We apply
transformation and add one more dimension as we call it z-axis. Lets assume value of points on z
plane, w = x² + y². In this case we can manipulate it as distance of point from z-origin. Now if we
plot in z-axis, a clear separation is visible and a line can be drawn .
When we transform back this line to original plane, it maps to circular boundary as shown in
image E. These transformations are called kernels.
Thankfully, you don’t have to guess/ derive the transformation every time for your data set. The
sklearn library's SVM implementation provides it inbuilt.
What if data plot overlaps? Or, what in case some of the black points are inside the blue ones?
Which line among 1 or 2?should we draw?
Image 1
Image 2
Which one do you think? Well, both the answers are correct. The first one tolerates some outlier
points. The second one is trying to achieve 0 tolerance with perfect partition.
But, there is trade off. In real world application, finding perfect class for millions of training
data set takes lot of time. As you will see in coding. This is called regularization parameter. In
next section, we define two terms regularization parameter and gamma. These are tuning
parameters in SVM classifier. Varying those we can achive considerable non linear classification
line with more accuracy in reasonable amount of time. In coding exercise (part 2 of this chapter)
we shall see how we can increase the accuracy of SVM by tuning these parameters.
One more parameter is kernel. It defines whether we want a linear of linear separation. This is
also discussed in next section.
Kernel
The learning of the hyperplane in linear SVM is done by transforming the problem using some
linear algebra. This is where the kernel plays role.
For linear kernel the equation for prediction for a new input using the dot product between the
input (x) and each support vector (xi) is calculated as follows:
This is an equation that involves calculating the inner products of a new input vector (x) with all
support vectors in training data. The coefficients B0 and ai (for each input) must be estimated
from the training data by the learning algorithm.
The polynomial kernel can be written as K(x,xi) = 1 + sum(x * xi)^d and exponential as K(x,xi)
= exp(-gamma * sum((x — xi²)).
Polynomial and exponential kernels calculates separation line in higher dimension. This is called
kernel trick
Regularization
The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the
SVM optimization how much you want to avoid misclassifying each training example.
For large values of C, the optimization will choose a smaller-margin hyperplane if that
hyperplane does a better job of getting all the training points classified correctly. Conversely, a
very small value of C will cause the optimizer to look for a larger-margin separating hyperplane,
even if that hyperplane misclassifies more points.
The images below (same as image 1 and image 2 in section 2) are example of two different
regularization parameter. Left one has some misclassification due to lower regularization value.
Higher value leads to results like right one.
Gamma
The gamma parameter defines how far the influence of a single training example reaches, with
low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma,
points far away from plausible seperation line are considered in calculation for the seperation
line. Where as high gamma means the points close to plausible line are considered in calculation.
High Gamma
Low Gamma
Margin
And finally last but very importrant characteristic of SVM classifier. SVM to core tries to
achieve a good margin.
A good margin is one where this separation is larger for both the classes. Images below gives to
visual example of good and bad margin. A good margin allows the points to be in their
respective classes without crossing to other class.
Background:
Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of
all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the
breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or
felt as lumps in the breast area.
Early diagnosis significantly increases the chances of survival. The key challenges against it’s
detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). A
tumor is considered malignant if the cells can grow into surrounding tissues or spread to distant
areas of the body. A benign tumor does not invade nearby tissue nor spread to other parts of the
body the way cancerous tumors can. But benign tumors can be serious if they press on vital
structures such as blood vessels or nerves.
Machine Learning technique can dramatically improve the level of diagnosis in breast cancer.
Research shows that experienced physicians can detect cancer by 79% accuracy, while a 91 %
( sometimes up to 97%) accuracy can be achieved using Machine Learning techniques.
Project Task
In this study, my task is to classify tumors into malignant (cancerous) or benign (non-cancerous)
using features obtained from several cell images.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
They describe characteristics of the cell nuclei present in the image.
Attribute Information:
1. ID number
2. Diagnosis (M = malignant, B = benign)
4.3.CODE
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Load the data
from google.colab import files # Use to load data on Google Colab
uploaded = files.upload() # Use to load data on Google Colab
df = pd.read_csv('breast cancer dataset 2.csv')
df.head()missing values (na, NAN, NaN)
#NOTE: This drops the column Unnamed
df = df.dropna(axis=1)
#Count the number of rows and columns in the data set
df.shape
#Count the empty (NaN, NAN, na) values in each column
df.isna().sum()
#Drop the column with all
#Get the new count of the number of rows and cols
df.shape
#Get a count of the number of Malignant (M) (harmful) or Benign (B) cells (not harmful)
df['smoothness'].value_counts()
#Visualize this count
sns.countplot(df['compactness'],label="Count")
#Look at the data types to see which columns need to be transformed / encoded to a number
df.dtypes
#Transform/ Encode the column diagnosis
#dictionary = {'M':1, 'B':0}#Create a dictionary file
#df.diagnosis = [dictionary[item] for item in df.diagnosis] #Change all 'M' to 1 and all 'B' to 0 in
the diagnosis col
#Encoding categorical data values (Transforming categorical data/ Strings to integers)
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
df.iloc[:,1]= labelencoder_Y.fit_transform(df.iloc[:,1].values)
print(labelencoder_Y.fit_transform(df.iloc[:,1].values))
#A “pairs plot” is also known as a scatterplot, in which one variable in the same data row is matc
hed with another variable's value
sns.pairplot(df, hue="smoothness")
#sns.pairplot(df.iloc[:,1:6], hue="diagnosis") #plot a sample of the columns
#Get the correlation of the columns
df.corr()
#Visualize the correlation
#NOTE: To see the numbers within the cell ==> sns.heatmap(df.corr(), annot=True)
plt.figure(figsize=(20,20)) #This is used to change the size of the figure/ heatmap
sns.heatmap(df.corr(), annot=True, fmt='.0%')
#plt.figure(figsize=(10,10)) #This is used to change the size of the figure/ heatmap
#sns.heatmap(df.iloc[:,1:12].corr(), annot=True, fmt='.0%') #Get a heap map of 11 columns, inde
x 1-11, note index 0 is just the id column and is left out.
#Split the data into independent 'X' and dependent 'Y' variables
X = df.iloc[:, 2:31].values #Notice I started from index 2 to 31, essentially removing the id colu
mn & diagnosis
Y = df.iloc[:, 1].values #Get the target variable 'diagnosis' located at index=1
# Split the dataset into 75% Training set and 25% Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
# Scale the data to bring all features to the same level of magnitude
# This means the data will be within a specific range for example 0 -100 or 0 - 1
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Create a function within many Machine Learning Models
def models(X_train,Y_train):
#Using Logistic Regression Algorithm to the Training Set
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(random_state = 0)
log.fit(X_train, Y_train)
#Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
knn.fit(X_train, Y_train)
#Using SVC method of svm class to use Support Vector Machine Algorithm
from sklearn.svm import SVC
svc_lin = SVC(kernel = 'linear', random_state = 0)
svc_lin.fit(X_train, Y_train)
#Using SVC method of svm class to use Kernel SVM Algorithm
from sklearn.svm import SVC
svc_rbf = SVC(kernel = 'rbf', random_state = 0)
svc_rbf.fit(X_train, Y_train)
#Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
from sklearn.naive_bayes import GaussianNB
gauss = GaussianNB()
gauss.fit(X_train, Y_train)
#Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
tree.fit(X_train, Y_train)
#Using RandomForestClassifier method of ensemble class to use Random Forest Classification
algorithm
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
forest.fit(X_train, Y_train)
#print model accuracy on the training data.
print('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train))
print('[1]K Nearest Neighbor Training Accuracy:', knn.score(X_train, Y_train))
print('[2]Support Vector Machine (Linear Classifier) Training Accuracy:', svc_lin.score(X_train
, Y_train))
print('[3]Support Vector Machine (RBF Classifier) Training Accuracy:', svc_rbf.score(X_train,
Y_train))
print('[4]Gaussian Naive Bayes Training Accuracy:', gauss.score(X_train, Y_train))
print('[5]Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train))
print('[6]Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train))
return log, knn, svc_lin, svc_rbf, gauss, tree, forest
#Show the confusion matrix and accuracy for all of the models on the test data
#Classification accuracy is the ratio of correct predictions to total predictions made.
from sklearn.metrics import confusion_matrix
for i in range(len(model)):
cm = confusion_matrix(Y_test, model[i].predict(X_test))
TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]
print(cm)
print('Model[{}] Testing Accuracy = "{}!"'.format(i, (TP + TN) / (TP + TN + FN + FP)))
print()# Print a new line
#Show other ways to get the classification accuracy & other metrics
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
for i in range(len(model)):
print('Model ',i)
#Check precision, recall, f1-score
print( classification_report(Y_test, model[i].predict(X_test)) )
#Another way to get the models accuracy on the test data
print( accuracy_score(Y_test, model[i].predict(X_test)))
print()#Print a new line
#Print Prediction of Random Forest Classifier model
pred = model[6].predict(X_test)
print('pred')
#Print a space
print()
#Print the actual values
print(Y_test)
4.4.SCREENSHOTS
There is a strong correlation between mean radius and mean perimeter, as well as mean
area and mean perimeter
Depending on how long we’ve lived in a particular place and traveled to a location, we probably
have a good understanding of commute times in our area. For example, we’ve traveled to
work/school using some combination of the metro, buses, trains, ubers, taxis, carpools, walking,
biking, etc.
Over time, our observations about transportation have built up a mental dataset and a mental
model that helps us predict what traffic will be like at various times and locations. We probably
use this mental model to help plan our days, predict arrival times, and many other tasks.
Model-based inference
Prediction
Inference is judging what the relationship, if any, there is between the data and the
output.
Prediction is making guesses about future scenarios based on data and a model
constructed on that data.
In this project, we will be talking about a Machine Learning Model called Support Vector
Machine (SVM)
A Support Vector Machine (SVM) is a binary linear classification whose decision boundary is
explicitly constructed to minimize generalization error. It is a very powerful and versatile
Machine Learning model, capable of performing linear or nonlinear classification, regression and
even outlier detection.
SVM is well suited for classification of complex but small or medium sized datasets.
It’s important to start with the intuition for SVM with the special linearly separable
classification case.
If classification of observations is “linearly separable”, SVM fits the “decision boundary” that
is defined by the largest margin between the closest points for each class. This is commonly
called the “maximum margin hyperplane (MMH)”.
If the number of features is much greater than the number of samples, avoid over-fitting
in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation (see Scores and probabilities).
Now that we have better understanding of Modeling and Support Vector Machine (SVM), let’s
start training our predictive model.
Model Training
From our dataset, let’s create the target and predictor matrix
“y” = Is the feature we are trying to predict (Output). In this case we are trying to predict
if our “target” is cancerous (Malignant) or not (Benign). i.e. we are going to use the
“target” feature here.
“X” = The predictors which are the remaining columns (mean radius, mean texture, mean
perimeter, mean area, mean smoothness, etc.)
Now that we’ve assigned values to our “X” and “y”, the next step is to import the python library
that will help us split our dataset into training and testing data.
Training data = the subset of our data used to train our model.
Testing data = the subset of our data that the model hasn’t seen before (We will be using
this dataset to test the performance of our model).
Let’s split our data using 80% for training and the remaining 20% for testing.
Now, let’s train our SVM model with our “training” dataset.
Let’s use our trained model to make a prediction using our testing data
Next step is to check the accuracy of our prediction by comparing it to the output we
already have (y_test). We are going to use confusion matrix for this comparison.
Let’s create a confusion matrix for our classifier’s performance on the test dataset.
As we can see, our model did not do a good job in its predictions. It predicted that 48
healthy patients have cancer. We only achieved 34% accuracy!
Data normalization is a feature scaling process that brings all values into range [0,1]
Now, let’s train our SVM model with our scaled (Normalized) datasets.
Our prediction got a lot better with only 1 false prediction(Predicted cancer instead of healthy).
We achieved 98% accuracy!
5. SYSTEM TESTING
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, subassemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each
test type addresses a specific testing requirement.
5.1.TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to determine if they actually
run as one program. Testing is event driven and is more concerned with the basic outcome of
screens or fields. Integration tests demonstrate that although the components were individually
satisfaction, as shown by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that arise from the
combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures: interfacing systems or procedures must be invoked.
Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects. The
task of the integration test is to check that components or software applications, e.g. components
in a software system or – one stepup – software applications at the company level – interact
without error.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
6. RESULT ANANLYSIS
7.1. CONCLUSION:
Our work mainly focused in the advancement of predictive models to achieve good accuracy in
predicting valid disease outcomes using supervised machine learning methods. The analysis of
the results signify that the integration of multidimensional data along with different
classification, feature selection and dimensionality reduction techniques can provide auspicious
tools for inference in this domain .Further research in this field should be carried out for the
better performance of the classification techniques so that it can predict on more variables
FUTURE SCOPE:
As not only the field of interest, but also the results of this study turned out to be rich and broad,
there are several ways to extend too many directions. Some of the possible ways are discussed
below to investigate this work in near future. In Jacobi moments calculation, a 4x4 overlapping
window is used for extracting features. As the size of the window increases the computation cost
will also increases. The computational cost can be reduced by using the non overlapping window
of varying size. However if the normal and abnormal regions are occupied in a single window of
small size, the performance rate will be reduced, whereas the performance rate may be increased
by using non overlapping window of varying size. In future, by carefully choosing the window
size and the number of features may be reduced to achieve same/better classification rate with
reduced computational cost. In the second proposed system, classification system based on
combined feature set, Haar discrete wavelet transform is used. Recently, a
101number of multi resolution analysis introduced by many researchers such as Contourlet
transform, Curvelet transform, Ridgelet Transform and so on. To improve the better
classification and performance of mammogram classification system, the features of normal and
abnormal regions must be clear. Instead of using wavelet transform to extract the features in this
method, the other multi resolution analysis may be used in future.Furthermore, a clinically useful
system for breast cancer detection must be able to detect the breast cancers, not just micro
calcifications. In future, the combined feature set method can be used for the detection of mass
classification.
8. REFERENCES
REFERENCES:
[3]American college of radiology, Reston VA, Illustrated Breast imaging Reporting and Data
system (BI-RADSTM) , third edition, 1998.
[6] T.W.Freer and M.J.Ulissey, “Screening mammography with computer aided detection:
prospective study of 2860 patients in a community breast cancer”, Radiology, vol.220, pp.781-
786, 2001.
[7] C. Cortes, V. N. Vapnik, “Support vector networks”, Machine learning Boston, vol.3,
Pg.273-297, September 1995.
[8]V. N. Vapnik, “An overview of statistical learning theory”, IEEE Trans. Neural Networks
New York, Vol. 10, pg. 998-999, September 19999.
[9]O. Chapelle, V. N. Venice, Y. Bengio, “Model selection for small sample regression”,
Machine Learning Boston vol.48, pg. 9-23, July 2002.
[10]Y. Liu, Y. F. Zhung, “FS_SFS: A novel feature selection method for support vector
machines”, pattern recognition New York, vol.39, pg.1333-1345, December 2006.
[11]N. Acir, “A support vector machine classifier algorithm based on a perturbation method and
its application to ECG beat recognition systems” Expert systems with application New York,
vol.31, pg. 150-158 July 2006.
[12]Girosi f. Jones M. and Poqqio T., “Regularization theory and neural network architectures”,
Neural computation Cambridge, vol.7, pg.217-269, July 1995.
[13]Smola A. J., Scholkopf B., and Muller K. R., “The connection between regularization
operators and support vector kernels”, Neural Networks New York, vol.11, pg 637-649,
November 1998.
[14]R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digital Image Processing Using MATLAB ,
prentice Hall, New Jersey, USA, 2004.
[15]C. Scott, “TEMPLAR software package, copyright 2001, Rice university”, 2001 [16]V. N.
Vapnik. The Nature of statistical learning theory Springer, 1995.