0% found this document useful (0 votes)
329 views

Final Project Record Exmple

Uploaded by

Salman Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
329 views

Final Project Record Exmple

Uploaded by

Salman Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 93

A PROJECT REPORT

ON
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE
Submitted in partial fulfilment of the requirements for the award of degree in
MASTER OF SCIENCE (COMPUTER SCIENCE)

SUBMITTED BY

S.VARALAKSHMI
Regd No: 18MCS024

UNDER THE GUIDANCE


OF

Mr. M.V.PAVAN KUMAR, MCA M.Tech (Ph.D.)


Assistant Professor, Dept. of MCA

DEPARTMENT OF COMPUTER SCIENCE

ISO 9001-2015Certified Re-accredited at ‘A’ by NAAC

KAKARAPARTI BHAVANARAYANA COLLEGE (AUTONOMOUS)


(Approved by AICTE, Affiliated to KRISHNA UNIVERSITY, MACHILIPATNAM)
Kothapet, Vijayawada, Krishna (Dt), pincode-520001
2018-2020
ISO: 9001-2015Certified Re-accredited at ‘A’ by NAAC
KAKARAPARTI BHAVANARAYANA PG COLLEGE (AUTONOMOUS)
(Approved by AICTE, Affiliated to KRISHNA UNIVERSITY, MACHILIPATNAM)
Kothapet, Vijayawada, Krishna(Dt), pincode-520001

DEPARTMENT OF COMPUTER SCIENCE

CERTIFICATE
This is to certify that this work entitled “DIAGNOSIS OF BREAST CANCER
DISEASE USING SUPPORT VECTOR MACHINE” is bonafide work carried out by
S.VARALAKSHMI (7724) in the partial fulfilment for the award of the degree in
MASTER OF SCIENCE (COMPUTER SCIENCE) of KRISHNA UNIVERSITY,
MACHILIPATNAM during the Academic year 2018-2020. It is certify that the
corrections / suggestions indicated for internal assessment have been incorporated in the
report. The project work has been approved satisfies the academic requirements in
respect of project work prescribed for the above degree.

Project Guide Head of the Department

External Examiner
ACKNOWLEDGMENT

The satisfaction that accompanies the successful completion of any task would be incomplete
without mentioning the people who made it possible and whose constant guidance and
encouragement crown all the efforts with success. This acknowledgement transcends the reality
of formality when we would like to express deep gratitude and respect to all those people behind
the screen who guided, inspired and helped me for the completion of the work. I wish to place on
my record my deep sense gratitude to my project guide, Mr. PAVAN KUMAR, Assistant
Professor, Department of MCA for his constant motivation and valuable help throughout the
project work.

My sincere thanks to Mrs. SHAMIM, Head of the Department of M.Sc. (CS) for her guidance
regarding the project. I also extended my thanks to Dr. P. BHARATHI DEVI, Head of the
Department of MCA for her valuable help throughout the project. I also extend my thanks to
Dr. MAZHARUNNISA BEGUM for P.G. CENTRE, I extend gratitude to SRI.
S.VENKATESH, DIRECTOR for P.G. COURSES for his valuable suggestions.

S.VARALAKSHMI

Regd.NO:18MCS024
DECLARATION

I hereby declare the project work entitled “DIAGNOSIS OF BREAST CANCER DISEASE
USING SUPPORT VECTOR MACHINE” submitted to K.B.N P.G COLLEGE affiliated to
KRISHNA UNIVERSITY, has been done under the guidance of Mr. PAVAN KUMAR,
Assistant Professor, Department of MCA during the period of study in that it has found
formed the basis for the award of the degree/diploma or other similar title to any candidate of
University.

Signature of the student


Name : S. VARALAKSHMI
Regd No : 7724
College Name : K.B.N PG COLLEGE

DATE:
PLACE: VIJAYAWADA
ABSTRACT

Breast cancer is an all too common disease in women, making how to effectively predict it an
active research problem. A number of statistical and machine learning techniques have been
employed to develop various breast cancer prediction models. Among them, support vector
machines (SVM) have been shown to outperform many related techniques. To construct the
SVM classifier, it is first necessary to decide the kernel function, and different kernel functions
can result in different prediction performance. However, there have been very few studies
focused on examining the prediction performances of SVM based on different kernel functions.
Moreover, it is unknown whether SVM classifier ensembles which have been proposed to
improve the performance of single classifiers can outperform single SVM classifiers in terms of
breast cancer prediction. Therefore, the aim of this project is to fully assess the prediction
performance of SVM and SVM ensembles over small and large scale breast cancer datasets. The
classification accuracy, ROC, F-measure, and computational times of training SVM and SVM
ensembles are compared.

Breast cancer prediction has long been regarded as an important research problem in the medical
and healthcare communities. This cancer develops in the breast tissue. There are several risk
factors for breast cancer including female sex, obesity, lack of physical exercise, drinking
alcohol, hormone replacement therapy during menopause, ionizing radiation, early age at first
menstruation, having children late or not at all, and older age.

There are different types of breast cancer, with different stages or spread, aggressiveness, and
genetic makeup. Therefore, it would be very useful to have a system that would allow early
detection and prevention which would increase the survival rates for breast cancer.

The literature discusses a number of different statistical and machine learning techniques that
have been applied to develop breast cancer prediction models, such as logistic regression, linear
discriminate analysis, naïve Bayes, decision trees, artificial neural networks, k-nearest neighbor,
and support vector machine methods .More specifically, studies comparing some of the above
mentioned techniques have shown that SVM performs better than many of the other related
techniques.
INDEX

S.NO CONTENTS PAGE NO


INTRODUCTION
1.1 BACKGROUND
1. 1–9
1.2 PROBLEM STATEMENT
1.3 PROPOSED SYSTEM
REQUIREMENT ANALYSIS
2.1 HARDWARE REQUIREMENTS
2. 2.2 SOFTWARE REQUIREMENTS 10 – 16
2.3 INTRODUCTION TO SOFTWARE ENVIRONMENT
2.4 FEASIBILITY STUDY
REVIEW OF LITERATURE
3. 3.1 REVIEW OF LITERATURE 17 – 27

DESIGN AND IMPLEMENTATION OF ALGORITHM


4.1 SVM ALGORITHM
4. 4.2 BREAST CANCER DETECTION THROUGH SVM 28 – 72
4.3 CODE
4.4 SCREENSHOTS
SYSTEM TESTING
5.1.TYPES OF TESTS
1.Unit testing
2.Integration testing
5. 72 – 77
3.Functional test
5.2.SYSTEM TEST
1.White Box Testing
2.Black Box Testing

6. RESULT ANALYSIS 77 – 81
7. CONCLUSION AND FUTURE SCOPE 82 – 83

8. REFERENCES 84 – 86
1. INTRODUCTION
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

1. INTRODUCTION
The analysis of gene expressions has been an important topic in Molecular Biology. A gene
expression is the process by which a functional gene product (e.g. protein) is synthesized using
information encoded in a gene. Microarrays are high through-put, high sensitivity tools that
allow the measurement of thousands of genes simultaneously. !e conduction of microarray
experiments is expensive. Therefore , measurements often exist only for very few samples. This
leads to data analysis problems in very high features spaces with relatively few samples, which
are an interesting challenge and have been subject to multiple machine learning approaches.

The CORUM database contains manually annotated protein complexes from mammalian
organisms and is maintained by the Munich information center for protein sequences (MIPS) at
the Helmholz Zentrum Munich. Its annotations for protein-complexes include protein-complex
function, localization, subunit composition as well as literature references and more.

We propose a new data integration method using the protein-complex information from the
CORUM database to compute virtual protein- complex expressions. While the study of multi-
class problems is less frequent, we believe that support vector machines with recursive feature
elimination (SVM-RFE) in combination with our new data integration method offer a potent
solution.

To verify our approach, we have trained a multi-class support vector machine with recursive
feature elimination on a large breast-cancer dataset and have determined its performance on a
smaller breast-cancer data set from a different study. Even though we were unable to increase
classification performance on the computed virtual protein-complex expressions over gene-
expressions, our novel data integration method coupled with multi-class SVM-RFE has extracted
biologically interesting and interpretable protein complexes. Therefore, our new approach can be
seen as an interesting method to analyses and interpret data sets of this kind. Furthermore, this
can lead to new insights into the role and function of protein-complexes by using virtual protein-
complex expressions from gene expression microarray data sets

DEPARTMENT OF M.Sc (CS) 1 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

1.1. BACKGROUND
The first chapter will be a short introduction to the relevant background. It will start with
molecular biology, continued by Support Vector Machines (SVMs) and possible derived
approaches to solve multi-class problems. It will end with a brief introduction to breast cancer.

THE CENTRAL DOGMA OF MOLECULAR BIOLOGY:

Figure: Diagram of the information ,ow between DNA, RNA and Protein according to the central
dogma of molecular biology. Black arrows denote flow of the general group and red arrows
denote ,flow of the special group.

In his article “Central dogma of molecular biology”, Crick wrote 1970:

“The central dogma of molecular biology deals with the detailed residue-by-residue transfer of
sequential information. It states that such information cannot be transferred from protein to either
protein or nucleic acid”, and elaborates this idea through-out the article. !e central dogma of
molecular biology deals with the information transfer, which can be split into three groups:
general, special and unknown. !e general group, which occurs in all cells with minor exceptions,
consists of the following three transports:

• deoxyribonucleic acid (DNA) → DNA,

• DNA → ribonucleic acid (RNA),

• RNA → Protein.

The special group does not occur in most cells, but has been shown to be present in virus
infected cells with ,ows from: RNA → RNA and RNA → DNA.

DEPARTMENT OF M.Sc (CS) 2 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

A third case for the special group: DNA → Protein ,flow has only been seen in cell-free systems
containing neomycin. !e other information ,flows are postulated by the central dogma to never
occur;

these are Protein → Protein, Protein → DNA and Protein → RNA, and belong to the unknown
group (Crick, 1970).

Genes:
We divide all cells into two basic types: prokaryotes and eukaryotes. The defining structure that
sets eukaryotic cells apart from prokaryotic cells is the nucleus. Prokaryotic cells have cell walls
containing glycopeptides. Whereas eukaryotic organisms are made from cells that are organized
by complex structures within membranes (Okafur, 2007, p.17). Human beings are eukaryotes
and contain billions of individual cells.

Almost all of these cells contain, within each nucleus, the complete hereditary information for
the organism in form of the genome. The genome consists of DNA and can be seen as the
blueprint for all structures of the organism. The human genome is made from 23 pairs of
chromosomes, where each pair is based on the chromosome pairs from their biological parents. !
e chromosomes contain chains of DNA, which consist of two polymers. These polymers are
wrapped around each other and form a structure known as double helix. The polymer strands are
held together by hydrogen bonds. They are large molecules of repeating monomers, which are
called nucleotides. Each nucleotide is made from deoxyribose sugar, a phosphate group and one
of the following four nitrogen bases: adenine, cytosine, guanine and thymine, usually represented
by their first letter A, C, G and T. Due to a property of the nitrogen bases (called complementary
base pairing) one can deduct from one strand of DNA the other complementary strand; in
particular, adenine can only form hydrogen bonds with cytosine, and guanine with thymine. The
sequence of these nucleotides in the double helix encode the hereditary genetic information.

Genes are, as Pearson puts it in 2006:

DEPARTMENT OF M.Sc (CS) 3 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

“A locatable region of genomic sequence, corresponding to a unit of inheritance, which is


associated with regulatory regions, transcribed regions and/or other functional sequence
regions”,

and can therefore be described by their ordered sequence of nitrogen bases. !e length of these
sequences can be hundreds of thousands of bases. The sequences encode particular patterns, of
which the exact number in the human genome is unknown. !e number of protein- coding genes is
estimated between 20,000 and 25,000 genes. To have the hereditary information available within
all cells, the DNA has to reside in each cell. Therefore, the DNA needs to be replicated to create
new cells, which is essential to multi-cell organisms. !e so-called DNA replication is the first of
the general information flows described by the central dogma of molecular biology. To form a
protein from a protein- coding gene, the gene information has to flow from the gene to the
messenger RNA (mRNA) using a process called transcription, and from the mRNA to the final
protein through the so-called translation. These are the second and third information flows as
postulated by the central dogma.

DNA replication:

Because complete DNA is contained within each cell of the organism, it needs to be replicated
upon cell division (cytokinesis). In eukaryotes, the replication is triggered at the end of the
interphase. !e interphase is followed by the separation of the chromosomes (mitosis), and the
immediate cytokinesis. During the DNA replication process the double helix is split into its
strands and each strand’s complementary pair is synthesized using an enzyme called DNA-
polymerase.

Transcription:

Pearson (2006) noted that the genes consist of regions that are regulatory as well as regions that
explicitly code for a protein. One of these regulatory regions is known as promoter, and is used
by the RNA-polymerase that drives the RNA synthesis. The RNA synthesis is similar to the
DNA synthesis, with the notable difference that only one strand is copied. Eukaryotic genes
consist of exons and introns, which are DNA regions within a gene. The difference between

DEPARTMENT OF M.Sc (CS) 4 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

exons and introns is that the final mRNA only represents the exons. During transcription, first
precursor mRNA (pre-mRNA) is transcribed from the DNA strand1 1. !e reverse way is used in
biotechnology as well as retroviruses, which include the HI / AIDS virus, where a single-
stranded RNA is transcribed to a single-stranded DNA. (Okafur, 2007, p.36)

A process called splicing later removes the introns from the pre-mRNA to yield the final one
stranded mRNA, which only consists of exons. The mRNA leaves the nucleus via the nuclear
membrane.

Translation:
Outside of the nucleus the mRNA is used as a template for the synthesis of proteins, which is
called translation. It contains the nucleotide uracil (U) instead of thymine. !e translation is done
by ribosomes, which are large complexes of proteins. These ribosomes read the genetic
information carried by the mRNA molecules in triplets of nucleotides, and combine any of the 20
amino acids in the human body into complex polypeptide chains through chemical reactions.
These triplets are called codons. The translation will start on the codon AUG and select
phenylalanine if it reads the codon UUU, or glycine on GGG2 2. These are examples, a complete
list can be found in Okafur (2007,p.38, Table 3.1); but if it reads the codon UAA, UAG or UGA
the translation will be terminated (Okafur, 2007, p.38).The resulting polypeptide chains then
form the protein.

The process, in which the information from a gene is used to synthesize a functional gene
product or protein via transcription and translation, is known as gene expression.

Proteins:

Proteins are macromolecules; they form the building blocks of the organism and are responsible
for numerous functions inside the living organism. Proteins are important for the metabolism,
which is the maintaining of life in living things through chemical reactions. As hormones,
proteins transport messages through the body. For example, the protein hemoglobin transports
oxygen. Proteins are also the basis for enzymes, catalysts for chemical reactions. Proteins form
DEPARTMENT OF M.Sc (CS) 5 KBN PG COLLEGE
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

3-dimensional structures by folding their amino acid backbone. !e protein’s shape depends on the
process it guides (Coen, 1999).

The linear chain of amino acids resulting from the translation phase is a so-called random-coil, a
simple polypeptide. To form a proper protein the random-coil needs to fold into a well-defined
three-dimensional structure, which defines its characteristic and function. !is process is driven by
the interaction of the amino-acids from the random-coils during and alter the protein synthesis. !e
correct fold is essential for proteins to function correctly, and failure results in inactive proteins
or toxins. Inside the human body, there are many different types of proteins. When several
proteins with different polypeptide chains form a complex where each polypeptide chain
contains different protein domains, the result can have multiple catalytic functions, and is called
a multi-protein complex or protein-complex.

Multiprotein Complexes:

Today, we know that the cell’s dry mass consists mostly of proteins. These protein molecules
form protein assemblies that carry out most of the major processes in the cell. These little
machines of ten or more protein molecules perform complex biological functions where each
assembly interacts with other protein-complexes (assemblies). These functions include cell cycle,
protein degradation and protein folding(Alberts, 1998).

A similar definition is given by Ruepp et al. (2007, 2010) in CORUM: the Comprehensive
Resource of Mammalian Protein Complexes with a slightly stronger emphasis on gene
dependence:

“Protein complexes are key molecular entities that integrate multiple gene products to perform
cellular functions.”

The field of proteomics, the large-scale study of proteins, can be divided into cell-map and
expression proteomics. The large-scale, quantitative study of protein-protein interactions through
their isolation of protein complexes is called cell-map proteomics. It studies in particular the
structure and function of the proteins contained within the protein-complexes. !e study of protein
expression changes is called expression proteomics. !e availability of complete sequences of the
genome has shifted the focus towards functional interpretation of genomics (Blackstock and
Weir, 1999). This has led to the creation of large scale databases containing protein-complexes,

DEPARTMENT OF M.Sc (CS) 6 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

subunits and functional description. One of the largest, freely accessible databases is the
CORUM database maintained as part of the Munich Information Center for Protein Sequences
(MIPS),

It is the comprehensive resource of mammalian protein-complexes, and contains mainly human


(64%), mouse (16%) and rat (12%) protein-complexes, which have been experimentally verified.
In 2007, the database contained more than 1,750 protein-complexes composed of 2,400 different
genes, representing ∼12% of the protein coding genes in humans. It has grown to more than
2,850 protein-complexes from 3,198 different genes by 2009, with the release of CORUM 2.0. It
now represents ∼16% of the proteincoding genes in humans (Ruepp et al., 2007, 2010).

DEPARTMENT OF M.Sc (CS) 7 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

1.2. PROBLEM STATEMENT

The breast cancer diagnosis was research by using SVM techniques.


The benchmarks used in this study are accuracy, ROC, measurement & computational time
of training. The results confirmed the superior performance of SVM algorithm to other
classifications.

DEPARTMENT OF M.Sc (CS) 8 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

1.3. PROPOSED SYSTEM

In present study, method consisting of 4 main phases was proposed for diagnosis of breast
cancer using SVM as feature extraction.

DEPARTMENT OF M.Sc (CS) 9 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

2. REQUIREMENT ANALYSIS

DEPARTMENT OF M.Sc (CS) 10 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

2.1. HARDWARE
REQUIREMENTS

 Processor – core processor


 RAM –8GB
 ROM-2 TB Hard Disk

2.2. SOFTWARE REQUIREMENTS

 Windows 10
 Python

2.3. INTRODUCTION TO SOFTWARE ENVIRONMENT

The abundant amount of information resulting from microarray experiments requires


algorithms that can deal with the high amount of data. The mathematical field of machine
learning studies the classification or recognition of patterns in vast amounts of data. “For
instance, cancer type classification based on microarray expression or determining whether a
protein binds to DNA or not based on sequence and structural motifs are good examples of
classification problems. ”Classification is a subtopic of the supervised learning category as
outlined in chapter 1 by Hastie et al. (2009):“The learning problems that we consider can be
roughly categorized as either supervised or unsupervised. In supervised learning, the goal is
to predict the value of an outcome measure based on a number of input measures; in
unsupervised learning, there is no outcome measure, and the goal is to describe the
associations and patterns among a set of input measures.”
Different methods have been applied to classification tasks in bio- informatics. Widely used
are Decision Trees, Support Vector Machines ,Bayesian Classifiers, Neural Networks and
many more (Mathura and Kangueane, 2009, ch. 3). !e following will introduce Support
Vector Machines, which will be used extensively throughout the following chapters.

DEPARTMENT OF M.Sc (CS) 11 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

2.4. FEASIBILITY STUDY


Python is an interpreted, high-level, general-purpose programming language. Created by Guido
van Rossum and first released in 1991, Python's design philosophy emphasizes code
readability with its notable use of significant whitespace. Its language constructs and object-
oriented approach aim to help programmers write clear, logical code for small and large-scale
projects.

Python is dynamically typed and garbage-collected. It supports multiple programming


paradigms, including procedural, object-oriented, and functional programming. Python is often
described as a "batteries included" language due to its comprehensive standard library.

Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released
in 2000, introduced features like list comprehensions and a garbage collection system capable of
collecting reference cycles. Python 3.0, released in 2008, was a major revision of the language
that is not completely backward-compatible, and much Python 2 code does not run unmodified
on Python 3.

The Python 2 language, i.e. Python 2.7.x, was officially discontinued on 1 January 2020 (first
planned for 2015) after which security patches and other improvements will not be released for
it. With Python 2's end-of-life, only Python 3.5.x and later are supported.

Python interpreters are available for many operating systems. A global community of


programmers develops and maintains CPython, an open source reference implementation.
A non-profit organization, the Python Software Foundation, manages and directs resources for
Python and CPython development.

Python is an easy to learn, powerful programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant
syntax and dynamic typing, together with its interpreted nature, make it an ideal language for
scripting and rapid application development in many areas on most platforms.

The Python interpreter is easily extended with new functions and data types implemented in C or
C++ (or other languages callable from C). Python is also suitable as an extension language for
customizable applications.

DEPARTMENT OF M.Sc (CS) 12 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

HISTORY:

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired
by SETL), capable of exception handling and interfacing with the Amoeba operating system. Its
implementation began in December 1989.[36] Van Rossum shouldered sole responsibility for the
project, as the lead developer, until 12 July 2018, when he announced his "permanent vacation"
from his responsibilities as Python's Benevolent Dictator For Life, a title the Python community
bestowed upon him to reflect his long-term commitment as the project's chief decision-maker.
[37]
 He now shares his leadership as a member of a five-person steering council. In January 2019,
active Python core developers elected Brett Cannon, Nick Coghlan, Barry Warsaw, Carol
Willing and Van Rossum to a five-member "Steering Council" to lead the project.[41]

Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-
detecting garbage collector and support for Unicode.

Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not
completely backward-compatible.[43] Many of its major features were backported to Python

2.6.x and 2.7.x version series. Releases of Python 3 include the  2to3  utility, which automates (at
least partially) the translation of Python 2 code to Python 3.

Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that
a large body of existing code could not easily be forward-ported to Python 3.

DEPARTMENT OF M.Sc (CS) 13 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Python Libraries

After Modules and Python Packages, we shift our discussion to Python Libraries. This Python
Library, we will discuss Python Standard library and different libraries offered by Python
Programming Language: pandas, Matplotlib, scipy, numpy, etc.

What is the Python Libraries?

We know that a module is a file with some Python code, and a package is a directory for sub
packages and modules. But the line between a package and a Python library is quite blurred.
A Python library is a reusable chunk of code that you may want to include in your programs/
projects. Compared to languages like C++ or C, a Python libraries do not pertain to any specific
context in Python. Here, a ‘library’ loosely describes a collection of core modules. Essentially,
then, a library is a collection of modules. A package is a library that can be installed using a
package manager like rubygems or npm.

Python Standard Library

The Python Standard Library is a collection of exact syntax, token, and semantics of Python. It
comes bundled with core Python distribution. We mentioned this when we began with an
introduction.

It is written in C, and handles functionality like I/O and other core modules. All this functionality
together makes Python the language it is. More than 200 core modules sit at the heart of the
standard library. This library ships with Python. But in addition to this library, you can also
access a growing collection of several thousand components from the Python Package Index
(PyPI). 

DEPARTMENT OF M.Sc (CS) 14 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Important Python Libraries

Matplotlib

Matplotlib helps with data analyzing, and is a numerical plotting library. We talked about it
in Python for Data Science.
Matplotlib is a widely used python based library; it is used to create 2d Plots and graphs easily
through Python script, it got another name as a pyplot. By using pyplot, we can create plotting
easily and control font properties, line controls, formatting axes, etc. It also provides a massive
variety of plots and graphs such as bar charts, histogram, power spectra, error charts, etc.

Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and


analysis tool using its powerful data structures. The name Pandas is derived from the word
Panel Data – an Econometrics from Multidimensional data.

In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.

Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of
data — load, prepare, manipulate, model, and analyze.

Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.

Key Features of Pandas

 Fast and efficient DataFrame object with default and customized indexing.

 Tools for loading data into in-memory data objects from different file formats.

 Data alignment and integrated handling of missing data.

 Reshaping and pivoting of date sets.

DEPARTMENT OF M.Sc (CS) 15 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

 Label-based slicing, indexing and subsetting of large data sets.

 Columns from a data structure can be deleted or inserted.

 Group by data for aggregation and transformations.

 High performance merging and joining of data.

 Time Series functionality.

Numpy

Python is increasingly being used as a scientific language. Matrix and vector manipulations are
extremely important for scientific computations. Both NumPy and Pandas have emerged to be
essential libraries for any scientific computation, including machine learning, in python due to
their intuitive syntax and high-performance matrix computation capabilities.

NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of
Python which provides fast mathematical computation on arrays and matrices. Since, arrays and
matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine
Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python
Machine Learning Ecosystem.

NumPy provides the essential multi-dimensional array-oriented computing functionalities


designed for high-level mathematical functions and scientific computation. Numpy can be
imported into the notebook using

>>Import numpy as np

NumPy’s main object is the homogeneous multidimensional array. It is a table with same type
elements, i.e, integers or string or characters (homogeneous), usually integers. In NumPy,
dimensions are called axes. The number of axes is called the rank.

There are several ways to create an array in NumPy like np.array, np.zeros, no.ones, etc. Each of
them provides some flexibility.
DEPARTMENT OF M.Sc (CS) 16 KBN PG COLLEGE
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

3. REVIEW OF LITERARTURE

DEPARTMENT OF M.Sc (CS) 17 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

3.1. REVIEW OF THE LITERATURE

EPIDEMIOLOGY

Breast cancer is the most commonly occurring cancer in women, comprising almost one third of
all malignancies in females. It is second only to lung cancer as a cause of cancer mortality, and it
is the leading cause of death for American women between the ages of 40 and 55.1 The lifetime
risk of a woman developing invasive breast cancer is 12.6 % 2 one out of 8 females in the United
States will develop breast cancer at some point in her life.2 The death rate for breast cancer has
been slowly declining over the past decade, and the incidence has remained level since 1988
after increasing steadily for nearly 50 years.3 Twenty-five percent to 30% of women with
invasive breast cancer will die of their disease.1 But this statistic, as grim as it is, also means that
70% to 75% of women with invasive breast cancer will die of something other than their breast
cancer. Hence, a diagnosis of breast cancer, even invasive breast cancer, is not necessarily the
‘‘sentence of death’’ that many women (and their insurance companies) imagine. Mortality rates
are highest in the very young (less than age 35) and in the very old (greater than age 75).4 It
appears that the very young have more aggressive disease, and that the very old may not be
treated aggressively or may have comorbid disease that increases breast cancer fatality.5
Although 60% to 80% of recurrences occur in the first 3 years, the chance of recurrence exists
for up to 20 years.

PATHOLOGY OF BREAST CANCER

Ninety-five percent of breast cancers are carcinomas, ie, they arise from breast epithelial
elements. Breast cancers are divided into 2 major types, in situ carcinomas and invasive (or
infiltrating) carcinomas. The in situ carcinomas may arise in either ductal or lobular epithelium,
but remain confined there, with no invasion of the underlying basement membrane that would
constitute extension beyond epithelial boundaries. As would be expected with such localized and
confined malignancy, there is negligible potential for metastases. When there is extension of the
ductal or lobular malignancy beyond the basement membrane that constitutes the epithelial

DEPARTMENT OF M.Sc (CS) 18 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

border, then the malignancy is considered invasive (or infiltrating) ductal or lobular carcinoma.
The potential for metastases and ultimately death occurs in invasive disease.

RISK FACTORS FOR DEVELOPMENT OF BREAST CANCER

Breast cancer incidence is highest in North America and Northern Europe and lowest in Asia and
Africa. Studies of migration patterns to the United States suggest that genetic factors alone do
not account for the incidence variation among countries, as the incidence rates of second-, third-
and fourth-generation Asian immigrants increase steadily in this country. Thus, environmental
and/or lifestyle factors appear to be important determinants of breast cancer risk.5 Gender is by
far the greatest risk factor. Breast cancer occurs 100 times more frequently in women than men.
In women, incidence rates of breast cancer rise sharply with age (see Table 1) until ages 45 to 50,
when the rise becomes less steep.4 This change in slope probably reflects the impact of hormonal
change (menopause) that occurs about this time. By ages 75 to 80, the curve actually flattens and
then decreases. Despite the steepness of the incidence curve at younger ages, the more important
issue is the increasing prevalence of breast cancer with advancing age, and the takehome
message for physicians and underwriters alike is that any breast mass in a postmenopausal
woman should be considered cancer until proven otherwise.8 Genetics plays a limited but
important role as a risk factor for breast cancer.

Only 5% to 6% of breast cancers are considered hereditary.9 BRCA-1 and BRCA-2 account for
an estimated 80% of hereditary breast cancer, but again, this only represents 5% to 6% of all
breast cancers. BRCA-1 and/or BRCA-2 positive women have a 50% to 85% lifetime risk of
developing breast cancer (see Table 1), and a 15% to 65% risk of developing ovarian cancer,

DEPARTMENT OF M.Sc (CS) 19 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

beginning at age 25.10 Familial breast cancer is considered a risk if a first-degree relative
develops breast cancer before menopause, if it affected both breasts, or if it occurred in
conjunction with ovarian cancer.

11 There is a 2-fold relative risk of breast cancer if a woman has a single firstdegree relative
(mother, sister or daughter). There is a 5-fold increased risk if 2 first-degree relatives have had
breast cancer.12 A woman’s hormonal history appears to be a risk factor, as the relative risk of
breast cancer seems to be related to the breast’s cumulative exposure to estrogen and
progesterone.

Early menarche (onset of menstruation , age 13), having no children or having them after age
30, and menopause after age 50 and especially age 55—all these mean more menstrual cycles
and thus greater hormone exposure.13 The Women’s Health Initiative (WHI), a randomized
controlled trial of 16,608 postmenopausal women comparing effects of estrogen plus progestin
with placebo on chronic disease risk, confirmed that combined estrogen plus progestin use
increases the risk of invasive breast cancer.14 Hormone replacement therapy (HRT) users have a
breast cancer risk that is 53% higher for combination therapy and 34% higher for estrogen alone,
especially if used for more than 5 years.

Although earlier studies suggested that this increased risk of cancer was offset by the fact that the
cancers induced by HRT were of more benign pathology and had a more favorable prognosis,4
reevaluation of the WHI data reveals this impression to be incorrect. Invasive breast cancers
associated with estrogen plus progestin use were larger (l.7 cm vs 1.5 cm, p 5 0.04), were more
likely to be node positive (26% vs 16%, p 5 0.03), and were diagnosed at a significantly more
advanced stage (regional/metastatic 25.4% vs 16%, p 5 0.04). The percentages and distribution
of invasive ductal, invasive lobular, mixed ductal, and lobular as well as tubular carcinomas were
similar in the estrogen plus progestin group vs the placebo group.15 Over observation time as
short as a year, there was a statistically significant increase in breast density in the estrogen plus
progestin group, resulting in increased incidence of abnormal mammograms (9.4% vs 5.4%,
p,0.001).

As noted by Gann and Morrow in a JAMA editorial, ‘‘the ability of combined hormone therapy
to decrease mammographic sensitivity creates an almost unique situation in which an agent
increases the risk of developing a disease while simultaneously delaying its detection.’’16 Li et

DEPARTMENT OF M.Sc (CS) 20 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

al reported that women using unopposed estrogen replacement therapy (ERT) had no appreciable
increase in the risk of breast cancer. However, use of combined estrogen and progestin hormone
replacement therapy had an overall 1.7-fold (95% CI 1.3– 2.2) increased risk of breast cancer,
including a 2.7-fold (95% CI 1.7–4.3) increased risk of invasive lobular carcinoma, a 1.5-fold
(95% CI, 1.1–2.0) increased risk of invasive ductal carcinoma, and a 2-fold (95% CI 1.5–2.7)
increased risk of ER1/PR1 breast cancers.17 Other risk factors for breast cancer include alcohol,
which has been linked to increased blood levels of estrogen interfering with folate metabolism
that protects against tumor growth. Women who drink .2 ounces of alcohol per day are 40%
more likely to develop breast cancer than women who drink no alcohol.

RELATIONSHIP OF BENIGN BREAST DISEASE WITH BREAST CANCER

This is an issue of great concern for patients, physicians and insurance companies alike, as there
are conditions that confer no risk of malignancy and others that definitely confer increased risk.
Breast biopsies conferring no significantly increased risk for malignancy include any lesion with
non-proliferative change.25,26 These include duct ectasia and simple fibroadenomas, benign
solid tumors containing glandular as well as fibrous tissue. The latter is usually single but may be
multiple. Solitary papillomas are also benign lesions conferring no increased risk of future
malignancy, despite the fact that they are often (in 21 of 24 women in a single study27) with
sanguineous or serosanguineous nipple discharge. Fibrocystic change (cysts and/or fibrous tissue
without symptoms) or fibrocystic disease (fibrocystic changes occurring in conjunction with
pain, nipple discharge, or a degree of lumpiness sufficient to cause suspicion of cancer) does not
carry increased risk for cancer (other than the potential for missing a malignant mass).

Some clinicians differentiate fibrocystic change or disease into those of hyperplasia, adenosis,
and cystic change because of their differentiation into age distributions. Hyperplasia
characteristically occurs in women in their 20s, often with upper outer quadrant breast pain and
an indurated axillary tail, as a result of stromal proliferation. Women in their 30s present with
solitary or multiple breast nodules 2–10 mm in size, as a result of proliferation of glandular cells.
Women in their 30s and 40s present with solitary or multiple cysts. Acute enlargement of cysts
may cause pain, and because breast ducts are usually patent, nipple discharge is common with
the discharge varying in color from pale green to brown.29 Conditions with increased risk of
malignancy include ductal hyperplasia without atypia. This is the most commonly encountered

DEPARTMENT OF M.Sc (CS) 21 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

breast biopsy result that is definitely associated with increased risk of future development of
breast cancer and confers a 2-fold increased risk.

The number, size and shape of epithelial cells lining the basement membrane of ducts are
increased, but the histology does not fulfill criteria for malignancy. The loss of expression of
transforming growth factor-b receptor II in the affected epithelial cells is associated with an
increased risk of invasive breast cancer.30 A number of other benign lesions also confer a
roughly 2-fold increased risk for development of breast cancer. These include sclerosing
adenosis, where lobular tissue undergoes hyperplastic change with increased fibrous tissue and
interspaced glandular cells, diffuse papillomatosis which is the formation of multiple papillomas,
and fibroadenomas with proliferative disease, which are tumors that contain cysts greater than 3
mm in diameter, with sclerosing adenosis, epithelial calcification, or papillary apocrine change.

Radial scars are benign breast lesions of uncertain pathogenesis, which are usually discovered
incidentally when a breast mass is removed for other reasons. Radial scars are characterized by a
fibroelastic core from which ducts and lobules radiate.31

DETECTION OF BREAST CANCER

As breast cancer rarely causes pain, a painless mass is much more worrisome for malignancy
than is one causing symptoms. Mammography done yearly beginning at age 40 is the current
recommendation for women with no risk factors.35 The most commonly encountered
categorization of mammography findings is summarized in Table 2. Although mammograms

DEPARTMENT OF M.Sc (CS) 22 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

may detect malignancy as small as 0.5 cm, 10% to 20% of malignancies elude detection by
mammography, even when they occur at a much larger size.36 In a patient with a solid,
dominant mass (suspicious mass) the primary purpose of the mammogram is to screen the
normal surrounding breast tissue and the opposite breast for nonpalpable cancers, not to make a
diagnosis of the palpable mass.8 Thus, a negative mammogram is no guarantee of absence of
malignancy, and a mass that does not disappear or collapse with aspiration must be assumed to
be a malignancy and biopsied.

DIAGNOSING BREAST CANCER:

THE BIOPSY There are 3 methods of obtaining material from a suspicious breast lump. Fine-
needle aspiration is not a reliable means of diagnosis, because it cannot distinguish ductal
carcinoma in situ from invasive cancer and it may lead to a false-negative result.1 Fine needle
aspiration (FNA) is generally reserved for palpable cyst-like lumps visible on a mammogram or
ultrasound. False positives are negligible but false-negative results occur in 15% to 20%, leading
to the recommendation that if the cyst or lump doesn’t disappear with FNA, further biopsy is
mandatory.8 Core needle biopsy has generally replaced fine needle aspiration in all but obvious
cysts. Core needle biopsies fail to identify areas of invasion in approximately 20% of cases
which are originally diagnosed as ductal carcinoma in situ. Atypical ductal hyperplasia in a core
needle biopsy has a relatively high incidence of coexistent carcinoma (approximately 50%). This
diagnosis, therefore, demands excisional biopsy

DEPARTMENT OF M.Sc (CS) 23 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

INTRADUCTAL (DUCTAL) CARCINOMA IN SITU (DCIS)

Intraductal (or ductal) carcinoma in situ (DCIS) is the proliferation of malignant epithelial cells
confined to ducts, with no evidence of invasion through the basement membrane. Prior to
mammography, DCIS was an uncommon diagnosis. With the introduction of routine
mammography, the ageadjusted incidence of DCIS rose from 2.3 to 15.8 per 100,000 females, a
587% increase. New cases of invasive breast cancer increased 34% over the same time period.48
About 85% of all intraductal cancers, often less than 1 cm, are discovered by the appearance of
clustered microcalcifications on mammography. Other conditions, including sclerosing adenosis
and atypical ductal hyperplasia, may also present on mammography with microcalcifications.
Morphology of the microcalcifications is the most important factor in differentiating benign from
malignant calcification.

Findings suggesting malignancy include heterogenous clustered calcifications, fine linear


branching calcifications, or calcifications in a segmental distribution. Magnification views of
benign findings often show multiple clusters of finely granular microcalcification, whereas those
associated with DCIS usually appear as coarser microcalcifications.49 For women with poorly
differentiated DCIS, the microscopic extent of disease correlates well with the radiographic
extent. In contrast, the mammographic appearance of well-differentiated DCIS can substantially
underestimate the microscopic extent. Residual microcalcifications on the post-surgery
mammogram indicates residual tumor with a positive-predictive value of 65% to 70%.50 The
likelihood of residual cancer increases to 90% if greater than 5 microcalcifications are seen on
post-operative mammography.

LOBULAR CARCINOMA IN SITU (LCIS)

As it is clinically undiagnosable (it is never a palpable mass and it has no distinguishing


mammographic features), the true incidence of LCIS is unknown.69 LCIS incidence in breast
masses removed has varied from 0.05% to as high as 10%,70,71,72 and the incidence of LCIS is
10-fold higher in white compared to African-American women in the United States.73 This
diagnosis is always made incidental to a needle biopsy or resected mass done for fibrocystic

DEPARTMENT OF M.Sc (CS) 24 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

change, fibroadenoma, or a mass suspected as being cancer.74 LCIS is more often detected in
premenopausal than postmenopausal women, suggesting a hormonal influence in the
development or maintenance of these lesions.75,76 LCIS requires no specific therapy per se.

Although the cells of LCIS are in fact small, well-differentiated neoplastic cells, they do not
behave as a true malignant neoplasm in that these cells may distend and distort the terminal-
lobular units, but invasion of and through the basement membrane does not occur, so the lesion
never results in invasive breast malignancy

STAGING AND PROGNOSIS OF BREAST CANCER

At initial diagnosis, over 50% of breast cancers are stages 0 or I,78 and 75% are Stage 0, I, or II.
(Table 4)79 The quantity of lymph node involvement has a profound impact on survival. Stage
IIA cancer (T0-T1, N1) with only 1 involved lymph node has a 10-year disease-free survival of
71% and a 20-year disease-free survival of 66%. If 2 to 4 lymph nodes are involved, the 10-year
disease-free survival is 62% and the 20-year disease-free survival is 56%.

SURGICAL TREATMENT OF BREAST CANCER

The Consensus Development Conference on the Treatment of Early-Stage Breast Cancer (June
1990, NCI) has concluded that breast conservation treatment is an appropriate method of primary
therapy for the majority of women with Stage I and Stage II breast cancers. This treatment is
preferable in many cases because it provides survival equivalent to total mastectomy and axillary
dissection while preserving the breast.80 Subsequent studies have confirmed that there is no
difference in long-term survival between surgical removal of the breast (mastectomy) and
excision of the tumor mass and radiation therapy to residual breast tissue (breast conservation
therapy).81–83 Breast-conserving surgery includes lumpectomy, re-excision, partial
mastectomy, quadrantectomy, segmental excision, and wide excision.

Axillary lymph nodes are removed for evaluation through a separate incision. The most
common breast-removal procedure is a modified-radical mastectomy, which involves making an
elliptical incision around an area including the nipple and biopsy scar, removing that section, and

DEPARTMENT OF M.Sc (CS) 25 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

tunneling under the remaining skin to remove the breast tissue and some lymph nodes. Radical
mastectomy, which removes the entire breast, chest wall muscles, and all axillary lymph nodes,
is rarely done today because it offers no survival advantage over a modified radical mastectomy.
A simple, or total mastectomy, removes the entire breast but none of the axillary lymph nodes.
This is usually done for women with DCIS, or prophylactically for women at especially high risk
for developing breast cancer. A newer procedure is the skinsparing mastectomy, which involves
removing the breast tissue through a circular incision around the nipple and replacing the breast
with fat taken from the abdomen or back.

RECOMMENDED SURVEILLANCE IN BREAST CANCER SURVIVORS

One trial randomly assigned breast cancer survivors to either a specialist or a family physician,
and found no differences between the 2 groups in measured outcomes, including time to
diagnosis of recurrence, anxiety, or health-related quality of life.92 A subsequent economic
analysis of this study found the quality of life as measured by frequency and length of patient
visits and costs were better when follow-up was provided by the family physician as compared to
the specialist.93 Routine history and physical examination and regularly scheduled
mammograms are the mainstay of care for the breast cancer survivor.94 Recurrence of breast
cancer is more frequently discovered by the patient (71%) than by her physician (15%).6 Women
should be encouraged to perform breast self-examination monthly.

Mammograms should be done at 6 and 12 months after surgery and then yearly thereafter.
Several tumor-associated antigens, including CA 15-3 and CEA, may detect breast cancer

DEPARTMENT OF M.Sc (CS) 26 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

recurrence, but not with sufficient sensitivity and specificity to be routinely used by either
clinicians95 or insurance underwriters. A newer marker, CA 27.29, showed promise in one well-
designed study of 166 women with stage II and III breast cancer. The sensitivity and specificity
of this test were 58% and 98%, respectively. Recurrence was detected approximately 5 months
earlier than with routine surveillance.96 However, improvement in survival or quality of life
using this marker has not yet been proven.

Neither routine chest x-rays nor serial radionucleotide bone scans have been found to be useful
in detecting metastatic disease in asymptomatic women.

DEPARTMENT OF M.Sc (CS) 27 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

4. DESIGN AND IMPLEMENTATION OF


ALGORITHM

DEPARTMENT OF M.Sc (CS) 28 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

4.1. SVM ALGORITHM


In machine learning, support-vector machines (SVMs, also support-vector networks)
are supervised learning models with associated learning algorithms that analyze data used
for classification and regression analysis. Given a set of training examples, each marked as
belonging to one or the other of two categories, an SVM training algorithm builds a model that
assigns new examples to one category or the other, making it a non-probabilistic binary linear
classifier (although methods such as Platt scaling exist to use SVM in a probabilistic
classification setting).

An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible. New
examples are then mapped into that same space and predicted to belong to a category based on
the side of the gap on which they fall.

In addition to performing linear classification, SVMs can efficiently perform a non-linear


classification using what is called the kernel trick, implicitly mapping their inputs into high-
dimensional feature spaces.

When data are unlabeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to groups, and
then map new data to these formed groups.

The support-vector clustering[2] algorithm, created by Hava Siegelmann and Vladimir Vapnik,


applies the statistics of support vectors, developed in the support vector machines algorithm, to
categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial
applications.

SVM is an exciting algorithm and the concepts are relatively simple.

Maximal-Margin Classifier

DEPARTMENT OF M.Sc (CS) 29 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

The Maximal-Margin Classifier is a hypothetical classifier that best explains how SVM works in
practice.

The numeric input variables (x) in your data (the columns) form an n-dimensional space. For
example, if you had two input variables, this would form a two-dimensional space.

A hyper plane is a line that splits the input variable space. In SVM, a hyper plane is selected to
best separate the points in the input variable space by their class, either class 0 or class 1. In two-
dimensions you can visualize this as a line and let’s assume that all of our input points can be
completely separated by this line. For example:

B0 + (B1 * X1) + (B2 * X2) = 0

Where the coefficients (B1 and B2) that determine the slope of the line and the intercept (B0) are
found by the learning algorithm, and X1 and X2 are the two input variables.

You can make classifications using this line. By plugging in input values into the line equation,
you can calculate whether a new point is above or below the line.

 Above the line, the equation returns a value greater than 0 and the point belongs to the
first class (class 0).
 Below the line, the equation returns a value less than 0 and the point belongs to the
second class (class 1).
 A value close to the line returns a value close to zero and the point may be difficult to
classify.
 If the magnitude of the value is large, the model may have more confidence in the
prediction.
The distance between the line and the closest data points is referred to as the margin. The best or
optimal line that can separate the two classes is the line that as the largest margin. This is called
the Maximal-Margin hyper plane.

DEPARTMENT OF M.Sc (CS) 30 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

The margin is calculated as the perpendicular distance from the line to only the closest points.
Only these points are relevant in defining the line and in the construction of the classifier. These
points are called the support vectors. They support or define the hyper plane.

The hyper plane is learned from training data using an optimization procedure that maximizes
the margin.

Soft Margin Classifier

In practice, real data is messy and cannot be separated perfectly with a hyper plane.

The constraint of maximizing the margin of the line that separates the classes must be relaxed.
This is often called the soft margin classifier. This change allows some points in the training data
to violate the separating line.

An additional set of coefficients are introduced that give the margin wiggle room in each
dimension. These coefficients are sometimes called slack variables. This increases the
complexity of the model as there are more parameters for the model to fit to the data to provide
this complexity.

A tuning parameter is introduced called simply C that defines the magnitude of the wiggle
allowed across all dimensions. The C parameters defines the amount of violation of the margin
allowed. A C=0 is no violation and we are back to the inflexible Maximal-Margin Classifier
described above. The larger the value of C the more violations of the hyper plane are permitted.

During the learning of the hyper plane from data, all training instances that lie within the
distance of the margin will affect the placement of the hyper plane and are referred to as support
vectors. And as C affects the number of instances that are allowed to fall within the margin, C
influences the number of support vectors used by the model.

 The smaller the value of C, the more sensitive the algorithm is to the training data (higher
variance and lower bias).
 The larger the value of C, the less sensitive the algorithm is to the training data (lower
variance and higher bias).

DEPARTMENT OF M.Sc (CS) 31 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Support Vector Machines (Kernels)

The SVM algorithm is implemented in practice using a kernel.

The learning of the hyper plane in linear SVM is done by transforming the problem using some
linear algebra, which is out of the scope of this introduction to SVM.

A powerful insight is that the linear SVM can be rephrased using the inner product of any two
given observations, rather than the observations themselves. The inner product between two
vectors is the sum of the multiplication of each pair of input values.

For example, the inner product of the vectors [2, 3] and [5, 6] is 2*5 + 3*6 or 28.

The equation for making a prediction for a new input using the dot product between the input (x)
and each support vector (xi) is calculated as follows:

f(x) = B0 + sum(ai * (x,xi))

This is an equation that involves calculating the inner products of a new input vector (x) with all
support vectors in training data. The coefficients B0 and ai (for each input) must be estimated
from the training data by the learning algorithm.

Linear Kernel SVM

The dot-product is called the kernel and can be re-written as:

K(x, xi) = sum(x * xi)

The kernel defines the similarity or a distance measure between new data and the support
vectors. The dot product is the similarity measure used for linear SVM or a linear kernel because
the distance is a linear combination of the inputs.

Other kernels can be used that transform the input space into higher dimensions such as a
Polynomial Kernel and a Radial Kernel. This is called the Kernel Trick.

DEPARTMENT OF M.Sc (CS) 32 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

It is desirable to use more complex kernels as it allows lines to separate the classes that are
curved or even more complex. This in turn can lead to more accurate classifiers.

Polynomial Kernel SVM

Instead of the dot-product, we can use a polynomial kernel, for example:

K(x,xi) = 1 + sum(x * xi)^d

Where the degree of the polynomial must be specified by hand to the learning algorithm. When
d=1 this is the same as the linear kernel. The polynomial kernel allows for curved lines in the
input space.

Radial Kernel SVM

Finally, we can also have a more complex radial kernel. For example:

K(x,xi) = exp(-gamma * sum((x – xi^2))

Where gamma is a parameter that must be specified to the learning algorithm. A good default
value for gamma is 0.1, where gamma is often 0 < gamma < 1. The radial kernel is very local
and can create complex regions within the feature space, like closed polygons in two-
dimensional space.

How to Learn a SVM Model

The SVM model needs to be solved using an optimization procedure.

You can use a numerical optimization procedure to search for the coefficients of the hyper plane.
This is inefficient and is not the approach used in widely used SVM implementations
like LIBSVM. If implementing the algorithm as an exercise, you could use stochastic gradient
descent.
There are specialized optimization procedures that re-formulate the optimization problem to be a
Quadratic Programming problem. The most popular method for fitting SVM is the Sequential

DEPARTMENT OF M.Sc (CS) 33 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Minimal Optimization (SMO) method that is very efficient. It breaks the problem down into sub-
problems that can be solved analytically (by calculating) rather than numerically (by searching or
optimizing).

Data Preparation for SVM

This section lists some suggestions for how to best prepare your training data when learning an
SVM model.

 Numerical Inputs: SVM assumes that your inputs are numeric. If you have categorical
inputs you may need to covert them to binary dummy variables (one variable for each category).
 Binary Classification: Basic SVM as described in this post is intended for binary (two-
class) classification problems. Although, extensions have been developed for regression and
multi-class classification.

Support Vector Machines (SVMs) are large margin classifiers. The margin can be understood as
the distance of the example to the separation boundary. The large margin classifier generates a
decision boundary with a large margin to almost all training examples. SVMs have been
introduced by Boser et al. (1992), and can be used for binary classification problems. SVMs are
no machines in the classical sense, and do not consist of any tangible parts. SVMs are merely
mathematical algorithms for pattern-matching. To understand the mathematical model, a few
definitions are needed.
Notation 1 (Scalar Product). Let Vn be an n-dimensional vector space, for x, y ∈ Vn the scalar
product will be denoted by:
⟨x, y⟩ ∶= ∑i xi ⋅ yi .
Notation 2 (Euclidean Norm). Let x ∈ Rn, the euclidean norm will be written as follows:
∣∣x∣∣ ∶=√⟨x, x⟩.
A hyper plane is an (n −1)-dimensional object in the same sense that in a three dimensional space
a two dimensional object can be seen as a plane. !e formal definition follows:

DEPARTMENT OF M.Sc (CS) 34 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Definition (Hyper plane). A hyper plane h ⊂ Rn can be explicitly defined by its perpendicular
vector w ∈ Rn and its distance b ∈ R from the origin:
h ∶= {x ∈ Rn ∶ ⟨x,w⟩ = b}.
If b = 0 the hyper plane is said to be unbiased. !e distance of the hyper plane to the origin is
b∣∣w∣∣ units.

Definition (Linearly Separable). A set of instance-label pairs


S ∶= {(xi , yi) ∶ xi ∈ Rn, yi ∈ {−1,1}, i = 1, . . . , l} ⊂ Rn × {−1,1}
is said to be linearly separable, if there exist w ∈ Rn and b ∈ R for a hyper plane
h = {x ∈ Rn ∶ ⟨x,w⟩ = b}, such that ∀(xi , yi) ∈ S ∶ yi(⟨xi ,w⟩ − b) > 0.
!e binary classification problem can therefore be written as the following:

Definition (Binary Classification Problem). Assume a set of instance label pairs


S = {(xi , yi) ∶ xi ∈ Rn, yi ∈ {−1,1}, i = 1, . . . , l}
!e objective is to end a prediction function
f ∶ Rn ↦ {−1,1}, which satisfies ∀(xi , yi) ∈ S ∶ f (xi)=yi .

Given a linearly separable binary classification problem, a hyper plane h = {x ∈ Rn ∶ ⟨x,w⟩ = b}


exists that clearly separates the set {(xi , yi) ∈ S ∶ yi = −1} from {(xi , yi) ∈ S ∶ yi = 1}. !e
prediction

DEPARTMENT OF M.Sc (CS) 35 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

function f ∶ Rn ↦ {−1,1} can then be derived from the hyper plane: f (x) = sign(⟨x,w⟩ − b). (2.9)
!e hyper plane is not unique as figure 2.4a shows; but only one hyper plane has the broadest
margin. !e margin describes the shortest distance from any point in S to h as depicted in figure
2.4b. !is leads to the formulation of the Support Vector Machine:

Definition (Support Vector Machine). !e support vector machine finds the hyper plane h with the
broadest margin for a linearly separable binary classification problem as the solution to
Min w,b1/2∣∣w∣∣2 subject to ∀(xi , yi) ∈ S ∶ yi(⟨xi ,w⟩ − b) ≥ 1.
!is optimization problem is known as the primal form of the SVM.
It can also be solved in its dual form.
Theorem (Dual Form of the Support Vector Machine). !e support vector machine can also be
solved in its dual form, and the optimal solution of the primal and dual form coincide.
Proof. Using the generalized Lagrangian, the primal form can be written as:
Min w,b max α1/2∣∣w∣∣2 − ∑i αi (yi(⟨xi ,w⟩ − b) − 1)

For which the dual form is defined as:


max α min w,b1/2∣∣w∣∣2 − ∑i αi (yi(⟨xi ,w⟩ − b) − 1) subject to 0 ≤ α.
Computing the partial derivatives for the primal variables, and equating them to 0 gives:
∂∂w L(w, b, α) = w − ∑i αi yixi= 0 ⇒ w = ∑i αi yixi ∂∂b L(w, b, α) = ⟨α, y⟩ != 0 (2.14)
Substituting equations 2.13 and 2.14 in 2.12 leads to the dual form:
max α ∑i ⎛⎝αi − 12 ∑jαiαj yi yj⟨xi , xj⟩⎞⎠ subject to 0 ≤ α ∧ ⟨α, y⟩ = 0.

Because the optimization problem is convex and Slater’s condition is satisfied, strong duality
holds:max0≤α minw,b L(w, b, α) = minw,b
max0≤α L(w, b, α).
!is implies that the optimal solution from the primal and dual form coincide. From the optimal
solution a∗ of the dual form, the primal variables w∗ and b∗ can be computed. From the Karush-
Kuhn-Tucker (KKT) conditions, the complementary slackness:
∀i ∶ α∗i (1 − yi(⟨xi ,w∗⟩ − b∗)) = 0 is used to compute b∗. It implies that ∀i ∶ α∗i > 0 the point xi
lies on the margin: yi(⟨xi ,w∗⟩ = 1). !is can also be seen in %gure 2.5b.

DEPARTMENT OF M.Sc (CS) 36 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

w∗ = ∑i α∗i yixi b∗ = 1/2 ( maxi∶yi=−1 ∧ α∗i >0⟨w∗, xi⟩ + mini∶yi=1 ∧ α∗i >0problems, a
penalty parameter3 3. Note on notation: if a variable x ∈ R is used where a vector is expected, x
implicitly represents the vector (x, x,..., x)t of the required dimension.

C ∈ R+ as well as slack variables ζi ∈ R+ for each point in S are introduced. ζ will denote the
vector (ζ1,ζ2,...)t!e slack variables are used to so then the constraints on the optimization
problem while the penalty parameter allows to adjust the impact of this lack variables on the
objective function .⟨w∗, xi⟩) (2.19)

In order to remove the restriction on linearly separable classification problems, a penalty


parameter3 3. Note on notation: if a variable x ∈ R is used where a vector is expected, x
implicitly represents the vector (x, x,..., x)t of the required dimension.
C ∈ R+ as well as slack variables ζi ∈ R+ for each point in S are introduced. ζ will denote the
vector (ζ1,ζ2,...)the slack variables are used to soften the constraints on the optimization problem
while the penalty parameter allows to adjust the impact of the slack variables on the objective
function.

DEPARTMENT OF M.Sc (CS) 37 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

What is a classification analysis?

Let’s consider an example to understand these concepts. We have a population composed of


50%-50% Males and Females. Using a sample of this population, you want to create some set of
rules which will guide us the gender class for rest of the population. Using this algorithm, we
intend to build a robot which can identify whether a person is a Male or a Female. This is a
sample problem of classification analysis. Using some set of rules, we will try to classify the
population into two possible segments. For simplicity, let’s assume that the two differentiating
factors identified are : Height of the individual and Hair Length. Following is a scatter plot of the
sample.

The blue circles in the plot represent females and green squares represents male. A few expected
insights from the graph are :

1. Males in our population have a higher average height.

2. Females in our population have longer scalp hairs.

DEPARTMENT OF M.Sc (CS) 38 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

If we were to see an individual with height 180 cms and hair length 4 cms, our best guess will be
to classify this individual as a male. This is how we do a classification analysis.

What is a Support Vector and what is SVM?

Support Vectors are simply the co-ordinates of individual observation. For instance, (45,150) is a
support vector which corresponds to a female. Support Vector Machine is a frontier which best
segregates the Male from the Females. In this case, the two classes are well separated from each
other, hence it is easier to find a SVM.

How to find the Support Vector Machine for case in hand?

There are many possible frontier which can classify the problem in hand. Following are the three
possible frontiers.

How do we decide which is the best frontier for this particular problem statement?

DEPARTMENT OF M.Sc (CS) 39 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

The easiest way to interpret the objective function in a SVM is to find the minimum distance of
the frontier from closest support vector (this can belong to any class). For instance, orange
frontier is closest to blue circles. And the closest blue circle is 2 units away from the frontier.
Once we have these distances for all the frontiers, we simply choose the frontier with the
maximum distance (from the closest support vector). Out of the three shown frontiers, we see the
black frontier is farthest from nearest support vector (i.e. 15 units).

What if we do not find a clean frontier which segregates the classes?

Our job was relatively easier finding the SVM in this business case. What if the distribution
looked something like as follows :

In such cases, we do not see a straight line frontier directly in current plane which can serve as
the SVM. In such cases, we need to map these vector to a higher dimension plane so that they get

DEPARTMENT OF M.Sc (CS) 40 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

segregated from each other. Such cases will be covered once we start with the formulation of
SVM. For now, you can visualize that such transformation will result into following type of
SVM.

Each of the green square in original distribution is mapped on a transformed scale. And
transformed scale has clearly segregated classes.

Support Vector Machines are very powerful classification algorithm. When used in conjunction
with random forest and other machine learning tools, they give a very different dimension to
ensemble models. Hence, they become very crucial for cases where very high predictive power is
required. Such algorithms are slightly harder to visualize because of the complexity in
formulation. You will find these algorithm very useful to solve some of the Kaggle problem
statement.

DEPARTMENT OF M.Sc (CS) 41 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

An example of support vector machine:

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating


hyperplane. In other words, given labeled training data (supervised learning), the algorithm
outputs an optimal hyperplane which categorizes new examples. In two dimentional space this
hyperplane is a line dividing a plane in two parts where in each class lay in either side.

Suppose you are given plot of two label classes on graph as shown in image (A). Can you decide
a separating line for the classes?

Image A: Draw a line that separates black circles and blue squares.

You might have come up with something similar to following image (image B). It fairly
separates the two classes. Any point that is left of line falls into black circle class and on right
falls into blue square class. Separation of classes. That’s what SVM does. It finds out a line/
DEPARTMENT OF M.Sc (CS) 42 KBN PG COLLEGE
DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

hyper-plane (in multidimensional space that separate outs classes). Shortly, we shall discuss why
I wrote multidimensional space.

Image B: Sample cut to divide into two classes.

1. Making it a Bit complex…

So far so good. Now consider what if we had data as shown in image below? Clearly, there is no
line that can separate the two classes in this x-y plane. So what do we do? We apply
transformation and add one more dimension as we call it z-axis. Lets assume value of points on z
plane, w = x² + y². In this case we can manipulate it as distance of point from z-origin. Now if we
plot in z-axis, a clear separation is visible and a line can be drawn .

Can you draw a separating line in this plane?

DEPARTMENT OF M.Sc (CS) 43 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

plot of zy axis. A separation can be made here.

When we transform back this line to original plane, it maps to circular boundary as shown in
image E. These transformations are called kernels.

Transforming back to x-y plane, a line transforms to circle.

Thankfully, you don’t have to guess/ derive the transformation every time for your data set. The
sklearn library's SVM implementation provides it inbuilt.

2. Making it a little more complex…

What if data plot overlaps? Or, what in case some of the black points are inside the blue ones?
Which line among 1 or 2?should we draw?

DEPARTMENT OF M.Sc (CS) 44 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

What in this case?

Image 1

DEPARTMENT OF M.Sc (CS) 45 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Image 2

Which one do you think? Well, both the answers are correct. The first one tolerates some outlier
points. The second one is trying to achieve 0 tolerance with perfect partition.

But, there is trade off. In real world application, finding perfect class for millions of training
data set takes lot of time. As you will see in coding. This is called regularization parameter. In
next section, we define two terms regularization parameter and gamma. These are tuning
parameters in SVM classifier. Varying those we can achive considerable non linear classification
line with more accuracy in reasonable amount of time. In coding exercise (part 2 of this chapter)
we shall see how we can increase the accuracy of SVM by tuning these parameters.

One more parameter is kernel. It defines whether we want a linear of linear separation. This is
also discussed in next section.

When somebody asks me for advice.

DEPARTMENT OF M.Sc (CS) 46 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

3. Tuning parameters: Kernel, Regularization, Gamma and Margin.

Kernel

The learning of the hyperplane in linear SVM is done by transforming the problem using some
linear algebra. This is where the kernel plays role.

For linear kernel the equation for prediction for a new input using the dot product between the
input (x) and each support vector (xi) is calculated as follows:

f(x) = B(0) + sum(ai * (x,xi))

This is an equation that involves calculating the inner products of a new input vector (x) with all
support vectors in training data. The coefficients B0 and ai (for each input) must be estimated
from the training data by the learning algorithm.

The polynomial kernel can be written as K(x,xi) = 1 + sum(x * xi)^d and exponential as K(x,xi)
= exp(-gamma * sum((x — xi²)).

Polynomial and exponential kernels calculates separation line in higher dimension. This is called
kernel trick

Regularization

The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the
SVM optimization how much you want to avoid misclassifying each training example.

For large values of C, the optimization will choose a smaller-margin hyperplane if that
hyperplane does a better job of getting all the training points classified correctly. Conversely, a
very small value of C will cause the optimizer to look for a larger-margin separating hyperplane,
even if that hyperplane misclassifies more points.

DEPARTMENT OF M.Sc (CS) 47 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

The images below (same as image 1 and image 2 in section 2) are example of two different
regularization parameter. Left one has some misclassification due to lower regularization value.
Higher value leads to results like right one.

Left: low regularization value, right: high regularization value

Gamma

The gamma parameter defines how far the influence of a single training example reaches, with
low values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma,
points far away from plausible seperation line are considered in calculation for the seperation
line. Where as high gamma means the points close to plausible line are considered in calculation.

DEPARTMENT OF M.Sc (CS) 48 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

High Gamma

Low Gamma

Margin

And finally last but very importrant characteristic of SVM classifier. SVM to core tries to
achieve a good margin.

A margin is a separation of line to the closest class points.

A good margin is one where this separation is larger for both the classes. Images below gives to
visual example of good and bad margin. A good margin allows the points to be in their
respective classes without crossing to other class.

DEPARTMENT OF M.Sc (CS) 49 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

4.2. BREAST CANCER DETECTION THROUGH SVM

Background:

Breast cancer is the most common cancer amongst women in the world. It accounts for 25% of
all cancer cases, and affected over 2.1 Million people in 2015 alone. It starts when cells in the
breast begin to grow out of control. These cells usually form tumors that can be seen via X-ray or
felt as lumps in the breast area.

Early diagnosis significantly increases the chances of survival. The key challenges against it’s
detection is how to classify tumors into malignant (cancerous) or benign(non cancerous). A
tumor is considered malignant if the cells can grow into surrounding tissues or spread to distant
areas of the body. A benign tumor does not invade nearby tissue nor spread to other parts of the
body the way cancerous tumors can. But benign tumors can be serious if they press on vital
structures such as blood vessels or nerves.

Machine Learning technique can dramatically improve the level of diagnosis in breast cancer.
Research shows that experienced physicians can detect cancer by 79% accuracy, while a 91 %
( sometimes up to 97%) accuracy can be achieved using Machine Learning techniques.

Project Task

DEPARTMENT OF M.Sc (CS) 50 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

In this study, my task is to classify tumors into malignant (cancerous) or benign (non-cancerous)
using features obtained from several cell images.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
They describe characteristics of the cell nuclei present in the image.

Attribute Information:

1. ID number
2. Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

1. Radius (mean of distances from center to points on the perimeter)


2. Texture (standard deviation of gray-scale values)
3. Perimeter
4. Area
5. Smoothness (local variation in radius lengths)
6. Compactness (perimeter² / area — 1.0)
7. Concavity (severity of concave portions of the contour)
8. Concave points (number of concave portions of the contour)
9. Symmetry
10. Fractal dimension (“coastline approximation” — 1)

DEPARTMENT OF M.Sc (CS) 51 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

4.3.CODE

#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Load the data
from google.colab import files # Use to load data on Google Colab
uploaded = files.upload() # Use to load data on Google Colab
df = pd.read_csv('breast cancer dataset 2.csv')
df.head()missing values (na, NAN, NaN)
#NOTE: This drops the column Unnamed
df = df.dropna(axis=1)

#Count the number of rows and columns in the data set
df.shape
#Count the empty (NaN, NAN, na) values in each column

DEPARTMENT OF M.Sc (CS) 52 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

df.isna().sum()
#Drop the column with all 
#Get the new count of the number of rows and cols
df.shape
#Get a count of the number of Malignant (M) (harmful) or Benign (B) cells (not harmful)
df['smoothness'].value_counts()
#Visualize this count
sns.countplot(df['compactness'],label="Count")
#Look at the data types to see which columns need to be transformed / encoded to a number
df.dtypes
#Transform/ Encode the column diagnosis
#dictionary = {'M':1, 'B':0}#Create a dictionary file
#df.diagnosis = [dictionary[item] for item in df.diagnosis] #Change all 'M' to 1 and all 'B' to 0 in 
the diagnosis col

#Encoding categorical data values (Transforming categorical data/ Strings to integers)
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
df.iloc[:,1]= labelencoder_Y.fit_transform(df.iloc[:,1].values)
print(labelencoder_Y.fit_transform(df.iloc[:,1].values))
#A “pairs plot” is also known as a scatterplot, in which one variable in the same data row is matc
hed with another variable's value
sns.pairplot(df, hue="smoothness")
#sns.pairplot(df.iloc[:,1:6], hue="diagnosis") #plot a sample of the columns
#Get the correlation of the columns
df.corr()
#Visualize the correlation 
#NOTE: To see the numbers within the cell ==>  sns.heatmap(df.corr(), annot=True)
plt.figure(figsize=(20,20))  #This is used to change the size of the figure/ heatmap
sns.heatmap(df.corr(), annot=True, fmt='.0%')
#plt.figure(figsize=(10,10)) #This is used to change the size of the figure/ heatmap

DEPARTMENT OF M.Sc (CS) 53 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

#sns.heatmap(df.iloc[:,1:12].corr(), annot=True, fmt='.0%') #Get a heap map of 11 columns, inde
x 1-11, note index 0 is just the id column and is left out.
#Split the data into independent 'X' and dependent 'Y' variables
X = df.iloc[:, 2:31].values #Notice I started from index  2 to 31, essentially removing the id colu
mn & diagnosis
Y = df.iloc[:, 1].values #Get the target variable 'diagnosis' located at index=1

# Split the dataset into 75% Training set and 25% Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)

# Scale the data to bring all features to the same level of magnitude
# This means the data will be within a specific range for example 0 -100 or 0 - 1

#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#Create a function within many Machine Learning Models
def models(X_train,Y_train):
  
  #Using Logistic Regression Algorithm to the Training Set
  from sklearn.linear_model import LogisticRegression
  log = LogisticRegression(random_state = 0)
  log.fit(X_train, Y_train)
  
  #Using KNeighborsClassifier Method of neighbors class to use Nearest Neighbor algorithm
  from sklearn.neighbors import KNeighborsClassifier
  knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)

DEPARTMENT OF M.Sc (CS) 54 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

  knn.fit(X_train, Y_train)

  #Using SVC method of svm class to use Support Vector Machine Algorithm
  from sklearn.svm import SVC
  svc_lin = SVC(kernel = 'linear', random_state = 0)
  svc_lin.fit(X_train, Y_train)

  #Using SVC method of svm class to use Kernel SVM Algorithm
  from sklearn.svm import SVC
  svc_rbf = SVC(kernel = 'rbf', random_state = 0)
  svc_rbf.fit(X_train, Y_train)

  #Using GaussianNB method of naïve_bayes class to use Naïve Bayes Algorithm
  from sklearn.naive_bayes import GaussianNB
  gauss = GaussianNB()
  gauss.fit(X_train, Y_train)

  #Using DecisionTreeClassifier of tree class to use Decision Tree Algorithm
  from sklearn.tree import DecisionTreeClassifier
  tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
  tree.fit(X_train, Y_train)

  #Using RandomForestClassifier method of ensemble class to use Random Forest Classification 
algorithm
  from sklearn.ensemble import RandomForestClassifier
  forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
  forest.fit(X_train, Y_train)
  
  #print model accuracy on the training data.
  print('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train))
  print('[1]K Nearest Neighbor Training Accuracy:', knn.score(X_train, Y_train))

DEPARTMENT OF M.Sc (CS) 55 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

  print('[2]Support Vector Machine (Linear Classifier) Training Accuracy:', svc_lin.score(X_train
, Y_train))
  print('[3]Support Vector Machine (RBF Classifier) Training Accuracy:', svc_rbf.score(X_train, 
Y_train))
  print('[4]Gaussian Naive Bayes Training Accuracy:', gauss.score(X_train, Y_train))
  print('[5]Decision Tree Classifier Training Accuracy:', tree.score(X_train, Y_train))
  print('[6]Random Forest Classifier Training Accuracy:', forest.score(X_train, Y_train))
  
  return log, knn, svc_lin, svc_rbf, gauss, tree, forest
#Show the confusion matrix and accuracy for all of the models on the test data
#Classification accuracy is the ratio of correct predictions to total predictions made.
from sklearn.metrics import confusion_matrix
for i in range(len(model)):
  cm = confusion_matrix(Y_test, model[i].predict(X_test))
  
  TN = cm[0][0]
  TP = cm[1][1]
  FN = cm[1][0]
  FP = cm[0][1]
  
  print(cm)
  print('Model[{}] Testing Accuracy = "{}!"'.format(i,  (TP + TN) / (TP + TN + FN + FP)))
  print()# Print a new line
#Show other ways to get the classification accuracy & other metrics 

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

for i in range(len(model)):
  print('Model ',i)
  #Check precision, recall, f1-score

DEPARTMENT OF M.Sc (CS) 56 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

  print( classification_report(Y_test, model[i].predict(X_test)) )
  #Another way to get the models accuracy on the test data
  print( accuracy_score(Y_test, model[i].predict(X_test)))
  print()#Print a new line
#Print Prediction of Random Forest Classifier model
pred = model[6].predict(X_test)
print('pred')

#Print a space
print()

#Print the actual values
print(Y_test)

4.4.SCREENSHOTS

Loading Python Libraries and Breast Cancer Dataset

DEPARTMENT OF M.Sc (CS) 57 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Let’s view the data in a dataframe

Features (Columns) breakdown

DEPARTMENT OF M.Sc (CS) 58 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Visualize the relationship between our features

DEPARTMENT OF M.Sc (CS) 59 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Let’s check the correlation between our features

There is a strong correlation between mean radius and mean perimeter, as well as mean
area and mean perimeter

DEPARTMENT OF M.Sc (CS) 60 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

What do we mean when we say “Modeling” ?

Depending on how long we’ve lived in a particular place and traveled to a location, we probably
have a good understanding of commute times in our area. For example, we’ve traveled to
work/school using some combination of the metro, buses, trains, ubers, taxis, carpools, walking,
biking, etc.

All humans naturally model the world around them.

Over time, our observations about transportation have built up a mental dataset and a mental
model that helps us predict what traffic will be like at various times and locations. We probably
use this mental model to help plan our days, predict arrival times, and many other tasks.

 As data scientists we attempt to make our understanding of relationships between


different quantities more precise through using data and mathematical/statistical
structures.
 This process is called modeling.
 Models are simplifications of reality that help us to better understand that which we
observe.
 In a data science setting, models generally consist of an independent variable (or output)
of interest and one or more dependent variables (or inputs) believed to influence the
independent variable.

Model-based inference

 We can use models to conduct inference.


 Given a model, we can better understand relationships between an independent variable
and the dependent variable or between multiple independent variables.

An example of where inference from a mental model would be valuable is:

Determining what times of the day we work best or get tired.

DEPARTMENT OF M.Sc (CS) 61 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Prediction

 We can use a model to make predictions, or to estimate a dependent variable’s value


given at least one independent variable’s value.
 Predictions can be valuable even if they are not exactly right.
 Good predictions are extremely valuable for a wide variety of purposes.

An example of where prediction from a mental model could be valuable:

Predicting how long it will take to get from point A to point B.

What is the difference between model prediction and inference?

 Inference is judging what the relationship, if any, there is between the data and the
output.
 Prediction is making guesses about future scenarios based on data and a model
constructed on that data.

In this project, we will be talking about a Machine Learning Model called Support Vector
Machine (SVM)

Introduction to Classification Modeling: Support Vector Machine (SVM)

What is a Support Vector Machine (SVM)?

A Support Vector Machine (SVM) is a binary linear classification whose decision boundary is
explicitly constructed to minimize generalization error. It is a very powerful and versatile
Machine Learning model, capable of performing linear or nonlinear classification, regression and
even outlier detection.

SVM is well suited for classification of complex but small or medium sized datasets.

DEPARTMENT OF M.Sc (CS) 62 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

How does SVM classify?

It’s important to start with the intuition for SVM with the special linearly separable
classification case.

If classification of observations is “linearly separable”, SVM fits the “decision boundary” that
is defined by the largest margin between the closest points for each class. This is commonly
called the “maximum margin hyperplane (MMH)”.

The advantages of support vector machines are:

 Effective in high dimensional spaces.


 Still effective in cases where number of dimensions is greater than the number of
samples.
 Uses a subset of training points in the decision function (called support vectors), so it is
also memory efficient.
 Versatile: different Kernel functions can be specified for the decision function. Common
kernels are provided, but it is also possible to specify custom kernels.

DEPARTMENT OF M.Sc (CS) 63 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

The disadvantages of support vector machines include:

 If the number of features is much greater than the number of samples, avoid over-fitting
in choosing Kernel functions and regularization term is crucial.
 SVMs do not directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation (see Scores and probabilities).

Now that we have better understanding of Modeling and Support Vector Machine (SVM), let’s
start training our predictive model.

Model Training

From our dataset, let’s create the target and predictor matrix

 “y” = Is the feature we are trying to predict (Output). In this case we are trying to predict
if our “target” is cancerous (Malignant) or not (Benign). i.e. we are going to use the
“target” feature here.
 “X” = The predictors which are the remaining columns (mean radius, mean texture, mean
perimeter, mean area, mean smoothness, etc.)

DEPARTMENT OF M.Sc (CS) 64 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Create the training and testing data

Now that we’ve assigned values to our “X” and “y”, the next step is to import the python library
that will help us split our dataset into training and testing data.

 Training data = the subset of our data used to train our model.
 Testing data = the subset of our data that the model hasn’t seen before (We will be using
this dataset to test the performance of our model).

Let’s split our data using 80% for training and the remaining 20% for testing.

DEPARTMENT OF M.Sc (CS) 65 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Import Support Vector Machine (SVM) Model

Now, let’s train our SVM model with our “training” dataset.

Let’s use our trained model to make a prediction using our testing data

DEPARTMENT OF M.Sc (CS) 66 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Next step is to check the accuracy of our prediction by comparing it to the output we
already have (y_test). We are going to use confusion matrix for this comparison.

Let’s create a confusion matrix for our classifier’s performance on the test dataset.

Let’s visualize our confusion matrix on a Heatmap

DEPARTMENT OF M.Sc (CS) 67 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

As we can see, our model did not do a good job in its predictions. It predicted that 48
healthy patients have cancer. We only achieved 34% accuracy!

Let’s explore ways to improve the performance of our model.

Improving our Model

The first process we will try is by normalizing our data

Data normalization is a feature scaling process that brings all values into range [0,1]

X’ = (X-X_min) / (X_max — X_min)

Normalize Training Data

DEPARTMENT OF M.Sc (CS) 68 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

DEPARTMENT OF M.Sc (CS) 69 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Normalize Training Data

DEPARTMENT OF M.Sc (CS) 70 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Now, let’s train our SVM model with our scaled (Normalized) datasets.

Prediction with Scaled dataset

Confusion Matrix on Scaled dataset

DEPARTMENT OF M.Sc (CS) 71 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Our prediction got a lot better with only 1 false prediction(Predicted cancer instead of healthy).
We achieved 98% accuracy!

DEPARTMENT OF M.Sc (CS) 72 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

5. SYSTEM TESTING

DEPARTMENT OF M.Sc (CS) 73 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

SYSTEM TESTING

The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, subassemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of test. Each
test type addresses a specific testing requirement.

5.1.TYPES OF TESTS

Unit testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.

Integration testing
Integration tests are designed to test integrated software components to determine if they actually
run as one program. Testing is event driven and is more concerned with the basic outcome of
screens or fields. Integration tests demonstrate that although the components were individually
satisfaction, as shown by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that arise from the
combination of components.

DEPARTMENT OF M.Sc (CS) 74 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Functional test

Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures: interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key functions, or


special test cases. In addition, systematic coverage pertaining to identify Business process flows;
data fields, predefined processes, and successive processes must be considered for testing.
Before functional testing is complete, additional tests are identified and the effective value of
current tests is determined.

5.2. System Test


System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions
and
flows, emphasizing pre-driven process links and integration points.

White Box Testing


White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used
to test areas that cannot be reached from a black box level.

DEPARTMENT OF M.Sc (CS) 75 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Black Box Testing


Black Box Testing is testing the software without any knowledge of the inner workings, structure
or language of the module being tested. Black box tests, as most other kinds of tests, must be
written from a definitive source document, such as specification or requirements document, such
as specification or requirements document. It is a testing in which the software under test is
treated, as a black box .you cannot “see” into it. The test provides inputs and responds to outputs
without
considering how the software works.

Unit Testing:

Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.

Test strategy and approach


Field testing will be performed manually and functional tests will be written in detail.

Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.

Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.

DEPARTMENT OF M.Sc (CS) 76 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Integration Testing

Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects. The
task of the integration test is to check that components or software applications, e.g. components
in a software system or – one stepup – software applications at the company level – interact
without error.

Test Results:
All the test cases mentioned above passed successfully. No defects encountered.

Acceptance Testing

User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.

Test Results:
All the test cases mentioned above passed successfully. No defects encountered.

DEPARTMENT OF M.Sc (CS) 77 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

6. RESULT ANANLYSIS

DEPARTMENT OF M.Sc (CS) 78 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

6.1. RESULT ANALYSIS:


our dataset contains 16 attributes dimensionality reduction contributes a lot in decreasing the
multi-dimensional data to a few dimensions. Of all the three applied algorithms Support Vector
Machine, k Nearest Neighbor and Logistic Regression, SVM gives the highest accuracy of
92.7% when compared to other two algorithms. So, we propose that SVM is the best suited
algorithm for the prediction of Breast Cancer Occurrence with complex datasets

DEPARTMENT OF M.Sc (CS) 79 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

DEPARTMENT OF M.Sc (CS) 80 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

Table1shows the comparison between the algorithms in terms of Accuracy,


Precision, Sensitivity, Specificity and False Positive Rate.

DEPARTMENT OF M.Sc (CS) 81 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

7. CONCLUSION AND FUTURE SCOPE

DEPARTMENT OF M.Sc (CS) 82 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

7.1. CONCLUSION:
Our work mainly focused in the advancement of predictive models to achieve good accuracy in
predicting valid disease outcomes using supervised machine learning methods. The analysis of
the results signify that the integration of multidimensional data along with different
classification, feature selection and dimensionality reduction techniques can provide auspicious
tools for inference in this domain .Further research in this field should be carried out for the
better performance of the classification techniques so that it can predict on more variables

FUTURE SCOPE:
As not only the field of interest, but also the results of this study turned out to be rich and broad,
there are several ways to extend too many directions. Some of the possible ways are discussed
below to investigate this work in near future. In Jacobi moments calculation, a 4x4 overlapping
window is used for extracting features. As the size of the window increases the computation cost
will also increases. The computational cost can be reduced by using the non overlapping window
of varying size. However if the normal and abnormal regions are occupied in a single window of
small size, the performance rate will be reduced, whereas the performance rate may be increased
by using non overlapping window of varying size. In future, by carefully choosing the window
size and the number of features may be reduced to achieve same/better classification rate with
reduced computational cost. In the second proposed system, classification system based on
combined feature set, Haar discrete wavelet transform is used. Recently, a
101number of multi resolution analysis introduced by many researchers such as Contourlet
transform, Curvelet transform, Ridgelet Transform and so on. To improve the better
classification and performance of mammogram classification system, the features of normal and
abnormal regions must be clear. Instead of using wavelet transform to extract the features in this
method, the other multi resolution analysis may be used in future.Furthermore, a clinically useful
system for breast cancer detection must be able to detect the breast cancers, not just micro
calcifications. In future, the combined feature set method can be used for the detection of mass
classification.

DEPARTMENT OF M.Sc (CS) 83 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

8. REFERENCES

DEPARTMENT OF M.Sc (CS) 84 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

REFERENCES:

[1]E.C.Fear, P.M.Meaney, and M.A.Stuchly,”Microwaves for breast cancer detection”, IEEE


potentials, vol.22, pp.12-18, February-March 2003

.[2]Homer MJ.Mammographic Interpretation: A practical Approach. McGraw hill, Boston, MA,


second edition, 1997.

[3]American college of radiology, Reston VA, Illustrated Breast imaging Reporting and Data
system (BI-RADSTM) , third edition, 1998.

[4] S.M.Astley,”Computer –based detection and prompting of mammographic abnormalities”,


Br.J.Radiol, vol.77, pp.S194-S200, 2004.

[5] L.J.W. Burhenne,”potential contribution of computer aided detection to the sensitivity of


screening mammography”, Radiology, vol.215, pp.554-562, 2000.

[6] T.W.Freer and M.J.Ulissey, “Screening mammography with computer aided detection:
prospective study of 2860 patients in a community breast cancer”, Radiology, vol.220, pp.781-
786, 2001.

[7] C. Cortes, V. N. Vapnik, “Support vector networks”, Machine learning Boston, vol.3,
Pg.273-297, September 1995.

[8]V. N. Vapnik, “An overview of statistical learning theory”, IEEE Trans. Neural Networks
New York, Vol. 10, pg. 998-999, September 19999.

[9]O. Chapelle, V. N. Venice, Y. Bengio, “Model selection for small sample regression”,
Machine Learning Boston vol.48, pg. 9-23, July 2002.

[10]Y. Liu, Y. F. Zhung, “FS_SFS: A novel feature selection method for support vector
machines”, pattern recognition New York, vol.39, pg.1333-1345, December 2006.

[11]N. Acir, “A support vector machine classifier algorithm based on a perturbation method and
its application to ECG beat recognition systems” Expert systems with application New York,
vol.31, pg. 150-158 July 2006.

DEPARTMENT OF M.Sc (CS) 85 KBN PG COLLEGE


DIAGNOSIS OF BREAST CANCER DISEASE USING SUPPORT VECTOR MACHINE

[12]Girosi f. Jones M. and Poqqio T., “Regularization theory and neural network architectures”,
Neural computation Cambridge, vol.7, pg.217-269, July 1995.

[13]Smola A. J., Scholkopf B., and Muller K. R., “The connection between regularization
operators and support vector kernels”, Neural Networks New York, vol.11, pg 637-649,
November 1998.

[14]R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digital Image Processing Using MATLAB ,
prentice Hall, New Jersey, USA, 2004.

[15]C. Scott, “TEMPLAR software package, copyright 2001, Rice university”, 2001 [16]V. N.
Vapnik. The Nature of statistical learning theory Springer, 1995.

DEPARTMENT OF M.Sc (CS) 86 KBN PG COLLEGE

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy