Breast Cancer (2013)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Research and applications

Breast cancer survivability prediction using labeled,


unlabeled, and pseudo-labeled patient data
Juhyeon Kim, Hyunjung Shin

Department of Industrial ABSTRACT outcome in terms of life expectancy, survivability,


Engineering, Ajou University, Background Prognostic studies of breast cancer progression, or tumor-drug sensitivity after the diag-
Suwon, South Korea
survivability have been aided by machine learning nosis of the disease. In this study, we focused on sur-
Correspondence to algorithms, which can predict the survival of a particular vivability prediction, which involves the use of
Professor Hyunjung (Helen) patient based on historical patient data. However, it is methods and techniques for predicting the survival
Shin, Department of Industrial not easy to collect labeled patient records. It takes at of a particular patient based on historical data.5 In

Downloaded from https://academic.oup.com/jamia/article-abstract/20/4/613/819025 by guest on 17 January 2020


Engineering, Ajou University,
least 5 years to label a patient record as ‘survived’ or general, ‘survival’ can be defined as the patient
San5 Wonchun-dong
Yeongtong-gu, ‘not survived’. Unguided trials of numerous types of remaining alive for a specified period after the diag-
Suwon 443-749, South Korea; oncology therapies are also very expensive. nosis of the disease. If the patient is still living
shin@ajou.ac.kr Confidentiality agreements with doctors and patients are 1825 days (5 years) after the date of diagnosis, the
also required to obtain labeled patient records. patient is considered to have survived.6 Note that
Received 16 December 2012
Revised 28 January 2013 Proposed method These difficulties in the collection the prediction of survivability is mainly used for
Accepted 2 February 2013 of labeled patient data have led researchers to consider analyses in which the interest is observing the time
Published Online First semi-supervised learning (SSL), a recent machine to death of a patient, whereas we addressed it as a
6 March 2013 learning algorithm, because it is also capable of utilizing classification problem, that is, predicting whether
unlabeled patient data, which is relatively easier to the patient belonged to the group who survived
collect. Therefore, it is regarded as an algorithm that after a specified period.
could circumvent the known difficulties. However, the Research into breast cancer using data mining or
fact is yet valid even on SSL that more labeled data lead machine learning methods has improved treat-
to better prediction. To compensate for the lack of ments, particularly less invasive predictive medi-
labeled patient data, we may consider the concept of cine. In Cruz and Wishart,7 the authors conducted
tagging virtual labels to unlabeled patient data, that is, a wide-ranging investigation of different machine
‘pseudo-labels,’ and treating them as if they were learning methods, discussing issues related to the
labeled. types of data incorporated and the performance of
Results Our proposed algorithm, ‘SSL Co-training’, these techniques in breast cancer prediction and
implements this concept based on SSL. SSL Co-training prognosis. That review provides detailed explana-
was tested using the surveillance, epidemiology, and end tions leading to first-rate research guidelines for the
results database for breast cancer and it delivered a application of machine learning methods during
mean accuracy of 76% and a mean area under the cancer prognosis. Delen et al5 used two popular
curve of 0.81. data mining algorithms, artificial neural networks
(ANN) and decision trees, together with a common
statistical method, logistic regression, to develop
INTRODUCTION prediction models for breast cancer survivability.
Breast cancer is the most common type of cancer The decision tree was shown to be the best pre-
and the second leading cause of cancer deaths in dictor. An improvement in the results of decision
women.1 2 The major clinical problem associated trees for the prognosis of breast cancer survivability
with breast cancer is predicting its outcome (survival is described in Khan et al.4 The authors propose a
or death) after the onset of therapeutically resistant hybrid prognostic scheme based on weighted fuzzy
disseminated disease. In many cases, clinically decision trees. This hybrid scheme is an effective
evident metastases have already occurred by the alternative to crisp classifiers that are applied inde-
time the primary tumor is diagnosed. In general, pendently. This approach analyzes the hybridization
treatments such as chemotherapy, hormone therapy, of accuracy and interpretability by using fuzzy logic
or a combination are considered to reduce the and decision trees. In Thongkam et al,8 the authors
spread of breast cancer because they decrease distant conducted data preprocessing with RELIEF attri-
metastases by one-third. Therefore, the ability to bute selection and used the Modest AdaBoost algo-
predict disease outcomes more accurately would rithm to predict breast cancer survivability. The
allow physicians to make informed decisions about study used the Srinagarind Hospital database. The
the potential necessity of adjuvant treatment. This results showed that Modest AdaBoost performed
could also lead to the development of individually better than Real and Gentle AdaBoost. The
tailored treatments to maximize the treatment effi- authors9 then proposed a hybrid scheme to gener-
ciency.3 4 Three predictive foci are related to cancer ate a high quality dataset to develop improved
prognosis: the prediction of cancer susceptibility breast cancer survival models.
(risk assessment); the prediction of cancer recur- A large volume of breast cancer patient data is
To cite: Kim J, Shin H. J
rence (redevelopment of cancer after resolution); required to build predictive models. In the machine
Am Med Inform Assoc and the prediction of cancer survivability. In the learning or data mining domain, the types of data
2013;20:613–618. third case, research is focused on predicting the are categorized as ‘labeled’ (feature/label pairs) or

Kim J, et al. J Am Med Inform Assoc 2013;20:613–618. doi:10.1136/amiajnl-2012-001570 613


Research and applications

‘unlabeled’ (features without labels). For patient data related to SEMI-SUPERVISED LEARNING
breast cancer survivability, the label tags a patient as ‘survived’ if In many real-world classification problems, the number of
they survived for a specified period or ‘not survived’ if they did class-labeled data points is small because they are often difficult,
not. Accumulating a large quantity of labeled data is time con- expensive, or time consuming to acquire and they may require
suming, costly, and it requires confidentiality agreements. In qualified human annotators, as described in Choi and Shin21
general, the collection of labeled survival data requires at least and Shin and colleagues.22 23 By contrast, unlabeled data can be
5 years.5 6 Moreover, oncologist consultation fees must be paid gathered easily and it can provide valuable information for
to confirm survivability. Furthermore, doctors and patients learning, as discussed in He et al.24 However, traditional classifi-
seldom reveal their information. Therefore, is it worth waiting cation algorithms such as supervised learning algorithms only
for 5 years to acquire the survival data, while also paying a sig- use labeled data so they encounter difficulties when only a few
nificant fee, expending considerable effort, and persuading labeled data are available. SSL uses labeled and unlabeled data
patients to disclose their personal medical data? By contrast, to improve the performance of supervised learning, as shown
unlabeled data can be collected with much less effort. Censored in He et al24 and Chapelle et al.25 In SSL, the classification

Downloaded from https://academic.oup.com/jamia/article-abstract/20/4/613/819025 by guest on 17 January 2020


data are abundant in survival analysis because in many cases the function is trained using a small set of labeled data
patient data have not been updated recently, so they remain fL ¼ fðxi ; yj Þni¼1
1
g and a large set of unlabeled data
n
unlabeled. Therefore, an economical solution may be to utilize a U ¼ fðxj Þj¼n1 þ1 g, where y=±1 indicates the labels. The total
large quantity of unlabeled data when building a predictive number of data points is n=n1=nu.26 There are several types of
model. This is achievable using semi-supervised learning (SSL) SSL algorithms, but graph-based SSL was used in our study. In
algorithms, which have recently emerged in the machine learn- graph-based SSL, a weighted graph is constructed in whiche the
ing domain. SSL is an appealing method in areas where labeled nodes represent the labeled and unlabeled data points while the
data are hard to collect. It has been used in areas such as text edges reflect the similarity between data points. According to
classification,10 text chunking,11 document clustering,12 time- Zhu,27 graph-based SSL methods are non-parametric, discrim-
series classification,13 gene expression data classification,14 15 inative, and transductive in nature. They assume label smooth-
visual classification,16 question-answering tasks for ranking can- ness over the graph. According to this assumption, if two data
didate sentences,17 and webpage classification.18 As with these points are coupled by a path of high density (eg, it is more
examples in other domains, SSL may be a good solution likely that both belong to the same group or cluster), their
because it can use censored data to modify or reprioritize sur- outputs are likely to be close, whereas their outputs need not be
vivability predictions obtained using labeled patient data alone. close if they are separated by a low-density region.25 There are
A good example of the application of SSL to the prognosis of many graph-based SSL algorithms, for example, mincut,
breast cancer survivability can be found in Shin et al,19 in which Gaussian random fields and harmonic functions, local and
the successful implementation of SSL predicted survival out- global consistency, Tikhonov regularization, manifold regulariza-
comes with reasonable accuracy and stability, thereby relieving tion, graph kernels from the Laplacian spectrum, and tree-based
oncologists of the burden of collecting labeled patient data. Bayes.17 27 There are many technical differences, but all of these
SSL is capable of utilizing unlabeled patient data, but the pre- methods use labeled nodes to set the labels y1 [ f1; þ1g,
diction accuracy of SSL increases with the amount of labeled while the unlabeled nodes are set to zero (yu=0), and the pair-
patient data, like most algorithms in machine learning. To over- wise relationships between nodes are represented using a simi-
come the aforementioned difficulties in the collection of labeled larity matrix.22 Figure 1 depicts a graph with eight data points,
patient data, it may be possible to obtain more labeled data by which are linked by the similarity between them.
generating labels for unlabeled data and treating them as if they
were labeled. These may be referred to as ‘pseudo-labeled’ data. 8 0 19
Note that labeled and unlabeled patient data are obtained dir- < ðxi  xj Þt ðxi  xj Þ =
@ if i  j A
ectly from a given dataset, whereas pseudo-labeled data are gen- wij ¼ exp  a2 ð1Þ
: ;
erated artificially by the proposed model in this paper. This is 0 otherwise
the motivation of our study. The proposed model is named as
SSL Co-training. The model is based on SSL and more than two
member models are used to generate pseudo-labels. Unlabeled
data become pseudo-labeled when agreement on labeling is
reached by the member models. This process is repeated until
no more agreement can be obtained. An increased prediction
accuracy for breast cancer survivability using labeled, unlabeled,
and pseudo-labeled patient data will allow medical oncologists
to select the most appropriate treatments for cancer patients.
The remainder of the paper is organized as follows. The next
section introduces SSL, which is the base algorithm for our pro-
posed co-training algorithm. The section entitled ‘Proposed
method: semi-supervised co-training’ explains our proposed SSL
Co-training algorithm in detail. The section on experiments pro-
vides the experimental results for a comparison of our proposed
algorithm and the latest machine learning models such as
support vector machines (SVM), ANN, and graph-based SSL.
We used the surveillance, epidemiology, and end results (SEER)
cancer incidence database, which is the most comprehensive Figure 1 Graph-based semi-supervised learning: labeled nodes are
source of information on cancer incidence and survival in the represented by ‘+1’ (survived) and ‘−1’ (not survived), whereas
USA.20 The final section presents our conclusions. unlabeled nodes are represented by ‘?’ (to be predicted).

614 Kim J, et al. J Am Med Inform Assoc 2013;20:613–618. doi:10.1136/amiajnl-2012-001570


Research and applications

Figure 2 The semi-supervised


learning (SSL) Co-training algorithm.

Downloaded from https://academic.oup.com/jamia/article-abstract/20/4/613/819025 by guest on 17 January 2020


The similarity between the two nodes xi and xj is represented is to assign labels to unlabeled data, that is, ‘pseudo-labels,’ and
by wij in a weight matrix W. A label can propagate from use them for model learning as if they were labeled. The pro-
(labeled) node to (unlabeled) node xj only when the value posed model generates pseudo-labels and it increases the per-
of wij is large. The value of wij can be measured using the formance of SSL. The model involves multiple member models
Gaussian function:25 in which pseudo-labels are determined based on agreements
In Eq. (1), i∼j indicates that an edge (link) can be constructed among the members. Therefore, it is named SSL Co-training.
between nodes xi and xj using the k nearest-neighbors SSL Co-training is described in this section, in which we limit
algorithm, where k is a user-defined hyperparameter. The the number of members to two for the sake of simplicity.
algorithm will output an n-dimensional real-valued vector The proposed algorithm is presented in figure 2. Let L and
f ¼ ½fTi fTu T ¼ ðf1 ; . . . ; fi ; fiþ1 ; . . . ; fn ¼ i þ uÞT , which can gen- U denote the sets of labeled and unlabeled datasets, respectively.
erate a threshold value to perform the label predictions on (f1, We assume that two member models, F1 and F2, are provided
…,fn) as a result of the learning. There are two assumptions: a (more concretely, two SSL classifiers) and that they are inde-
loss function (fi should be close to the given label of yi in pendent. At the start of the algorithm, each of the two classifiers
labeled nodes) and label smoothness (overall, fi should not be is trained on L and U following the objective function in Eq. (2)
too different from fi for the neighboring nodes). These assump- as an ordinary SSL classifier. After training, both classifiers
tions are reflected in the value of f by minimizing the following produce two sets of prediction scores for U according to
quadratic function:21 22 28 29 Eq. (3). Let us denote them as f1 and f2, respectively. The values
of f1 are continuous, so discretization is required to make binary
minðf  yÞT ðf  yÞ þ mfT Lf; ð2Þ labels for U. A simple rule of setting the midpoint of f1 as the
f cutoff value m1 provides labels for all of the unlabeled data:
where y ¼ ðy1 ; :::; y1 ; 0; :::; 0Þt and the matrix L, which is known y1u=1 if f1 is larger than m1, whereas y1u=−1otherwise. For the
as the graph LaplacianP matrix, is defined as L=D − W where classifier F2, y2u is similarly obtained from the prediction score f2
D ¼ diagðdi Þ; di ¼ i wij . The parameter μ trades off loss and and its midpoint of m2. The labels of F1 may be concordant or
smoothness. Therefore, the solution of this problem becomes conflict with those of F2. For unlabeled data points in U, the
algorithm assigns pseudo-labels yu only when all of the
f ¼ ðI þ mLÞ1 y ð3Þ members agree on labeling because it gives higher confidence
about the newly made labels. An unlabeled data point takes the
value of its pseudo-label yu either from F1 or from F2 when
y1u=y2u, or it remains unlabeled. The unlabeled data points that
PROPOSED METHOD: SEMI-SUPERVISED CO-TRAINING failed to obtain pseudo-labels are denoted as ‘boosted samples’.
SSL may be a good candidate to use as a predictive model for During the next iteration, the unlabeled data points with
cancer survivability, particularly when the available dataset for pseudo-labels are added to the labeled dataset L, whereas the
model learning has an abundance of unlabeled patient cases but boosted samples remain in the unlabeled dataset U. As the iter-
a lack of labeled ones. Like many other machine learning algo- ation proceeds, the size of L increases whereas that of
rithms, however, the availability of more labeled data leads to U decreases. The iteration stops if the size of U (the number of
better performance. A solution for obtaining more labeled data boosted samples) stops decreasing. Figure 3A shows the

Figure 3 Patterns of (a) the number


of boosted samples and (b) model
performance during the iterations of
semi-supervised learning (SSL)
Co-training.

Kim J, et al. J Am Med Inform Assoc 2013;20:613–618. doi:10.1136/amiajnl-2012-001570 615


Research and applications

Downloaded from https://academic.oup.com/jamia/article-abstract/20/4/613/819025 by guest on 17 January 2020


Figure 4 Schematic description of semi-supervised learning Co-training. In the beginning (iteration 0), the two data points x1 and x5 belong to
the labeled set L={(x1 1), (x2, −1)} and the labels are given as y1=+1 and y5=−1, respectively. x2, x3 and x4 belong to the unlabeled dataset U={x2,
x2, x3}. After training (iteration 1), the predicted labels for the three data points are given by F1 and F2. For x2, the two classifiers agree on labeling
y12=y22=+1, so its pseudo-label becomes x2=1. Likewise, x4 obtains the pseudo-label y4=−1. However, the two classifiers disagree on the labeling of
x3:y13=+1 but y23=−1. Therefore, x3 is a boosted sample, according to the definition of the proposed algorithm, and it remains unlabeled. In the next
iteration (iteration 2), the labeled dataset is increased by the two pseudo-labeled data points L={(x1,+1),(x2,+1),(x4,−1)}, and the unlabeled dataset
is decreased to U={x3}. Similar to the previous iteration, F1 and F2 provide x3 with the predicted labels y13=+1 and y23=−1 1, respectively. However,
they still fail to agree on the labeling of x3. The number of boosted samples is the same as the previous iteration, so the algorithm stops.

decreasing pattern for the number of boosted samples during conducted so the two sub-sets are maximally uncorrelated, that
the iterations. Figure 3B shows the increasing pattern of the is, the attributes in one set are not correlated with those in the
model performance due to the increasing size of the labeled other set.
data points (note that the performances of the two member clas-
sifiers also increase). The toy example shown in figure 4 is EXPERIMENTS
helpful for understanding the proposed algorithm. Data, performance measurement, and experimental setting
The member composition used for SSL Co-training can be The breast cancer survivability dataset (1973–2003) from SEER
diverse. First, the number of members is not limited, so they was used for the experiment, which is an initiative of the
can be multiple. Second, different member models can be built National Cancer Institute and the premier source for cancer sta-
from different data sources or different model parameters. In tistics in the USA (http://www.seer.cancer.gov).20 SEER claims to
the current study, the two member models, F1 and F2, were have one of the most comprehensive collections of cancer statis-
built by splitting a dataset into two sub-datasets. The split is tics. It includes incidence, mortality, prevalence, survival,

Table 1 Prognostic elements related to breast cancer survivability


No. Features Description No. Features Description

1 Stage Defined by size of cancer tumor and its spread 9 Site-specific surgery Information on surgery during first course of therapy,
whether cancer-directed or not
2 Grade Appearance of tumor and its similarity to more or less 10 Radiation None, beam radiation, radioisotopes, refused,
aggressive tumors recommended, etc.
3 Lymph node None, (1–3) minimal, (4–9) significant, etc 11 Histological type Form and structure of tumor
involvement
4 Race Ethnicity: white, black, Chinese, etc 12 Behavior code Normal or aggressive tumor behavior is defined using
codes.
5 Age at diagnosis Actual age of patient in years 13 No of positive nodes When lymph nodes are involved in cancer, they are
examined known as positive.
6 Marital status Married, single, divorced, widowed, separated 14 No of nodes examined Total nodes (positive/negative) examined
7 Primary site Presence of tumor at particular location in body. 15 No of primaries No of primary tumors (1–6)
Topographical classification of cancer.
8 Tumor size 2–5 cm; at 5 cm, the prognosis worsens 16 Clinical extension of Defines the spread of the tumor relative to the breast
tumor
17 Survivability Target binary variable defines class of survival of patient.

616 Kim J, et al. J Am Med Inform Assoc 2013;20:613–618. doi:10.1136/amiajnl-2012-001570


Research and applications

Figure 5 Changes during


semi-supervised learning (SSL)
Co-training iterations: (A) the number
of boosted samples and (B) area under
the curve.

lifetime risk, and statistics by race/ethnicity. The data consists of cancer survivability. The model parameters were searched over

Downloaded from https://academic.oup.com/jamia/article-abstract/20/4/613/819025 by guest on 17 January 2020


162 500 records with 16 predictor features and one target class the following ranges for the respective models. For ANN, the
variable. There are 16 features: tumor size, number of nodes, number of ‘hidden nodes’ and the ‘random seed’ for the initial
number of primaries, age at diagnosis, number of positive weights were searched over hidden-node={3, 6, 9, 12, 15} and
nodes, marital status, race, behavior code, grade, extension of random-seed={1, 3, 5, 7, 10}.31 For SVM, the values for the
tumor, node involvement, histological type according to the RBF kernel width ‘gamma’ and the loss penalty term ‘C’ were
international classification of diseases (ICD), primary site, selected by searching the ranges of C={0.2, 0.4, 0.6, 0.8, 1} and
site-specific surgery, radiation, and stage. The target variable gamma={0.0001, 0.001, 0.01, 0.1, 1}.32 For the SSL and
‘survivability’ in the SEER dataset is a binary categorical feature SSL-Co training models, the values for the number of neighbors
with values ‘–1’ (not survived) or +1 (survived). Table 1 sum- ‘k’ and the trade-off parameter ‘mu’ between the smoothness
marizes the features and their descriptions. The breast cancer condition and loss condition in (1) were searched over k={3, 7,
survival dataset contains 128 469 positive cases and 34 031 15, 20, 30} and mu={0.0001, 0.01, 1, 100, 1000}, respectively.
negative cases. To avoid the difficulties in model learning caused
by the large-sized and class-imbalanced dataset, 40 000 data RESULTS
points were used for the training set and 10 000 for the test set, SSL Co-training using each of the 10 datasets proceeded with
which were drawn randomly without replacement. The equi- iterations between 3 and 5. Figure 5 shows the typical changes
poise dataset of 50 000 data points was eventually divided into in the number of boosted samples and AUC as the iterations
10 groups and fivefold cross validation was applied to each. proceeded. The number of boosted samples decreased as the
We used the accuracy and the area under the receiver operat- iterations proceeded, as shown in figure 5A, while the AUC per-
ing characteristic curve (AUC) as performance measures.10 30 formance in figure 5B increased due to the enhancement of the
Accuracy is a measure of the total number of correct predictions labeled dataset with pseudo-labeled data points. Note that the
when the value of the classification threshold is set to 0. By con- increasing patterns in the AUC for the two member models F1
trast, the AUC assesses the overall value of a classifier, which is a and F2 demonstrate the success of co-training between them,
threshold-independent measure of model performance based on that is, F1 helps to raise the performance of F2 and vice versa.
the receiver operating characteristic curve that plots the trade- Table 2 shows a comparison of the results with ANN, SVM,
offs between sensitivity and 1−specificity for all possible values SSL, and SSL Co-training in terms of the accuracy and AUC.
of threshold. For each of the four models, the best performance was selected
Four representative models, that is, ANN, SVM, SSL, and by searching over the respective model-parameter space. For the
SSL-Co training, were used to perform classification for breast 10 datasets, the best performance among the four models is
marked in bold face. In terms of accuracy, SSL Co-training deliv-
ered outstanding performance with an average accuracy of 0.76
Table 2 Performance comparison using ANN, SVM, SSL, and SSL while SSL was ranked the second best. In terms of the AUC,
Co-training with the 10 datasets SSL Co-training produced an average AUC of 0.81, which was
Accuracy AUC
the best of the three models, although comparable performance
was delivered by SVM. Figure 6 summarizes the performance of
SSL SSL
the four models using two radar graphs.
Dataset ANN SVM SSL Co-training ANN SVM SSL Co-training

1 0.66 0.52 0.72 0.77 0.68 0.79 0.77 0.84 CONCLUSION


2 0.67 0.52 0.72 0.79 0.72 0.79 0.79 0.82 To predict cancer survivability, the acquisition of more patient
3 0.62 0.50 0.70 0.76 0.68 0.80 0.78 0.78 data with labels of either ‘survived’ or ‘not survived’ is an
4 0.67 0.51 0.68 0.75 0.72 0.79 0.76 0.81 important issue because better predictive models can be pro-
5 0.64 0.52 0.71 0.77 0.66 0.82 0.78 0.82 duced based on them. In practice, however, there are many
6 0.62 0.52 0.71 0.76 0.68 0.78 0.77 0.83 obstacles when collecting patient labels because of the limita-
7 0.63 0.51 0.69 0.77 0.67 0.79 0.77 0.83 tions of time, cost, and confidentiality conflicts. Therefore,
8 0.69 0.51 0.73 0.76 0.73 0.82 0.80 0.82 researchers have been attracted to predictive models that can
9 0.66 0.52 0.70 0.74 0.71 0.81 0.78 0.78 also utilize unlabeled patient data, which are relatively more
10 0.64 0.51 0.73 0.77 0.73 0.81 0.80 0.81 abundant. SSL has thus been highlighted as a promising candi-
Average 0.65 0.51 0.70 0.76 0.70 0.80 0.78 0.81 date. However, the tenet that ‘the more labeled data, the better
ANN, artificial neural network; AUC, area under the curve; SSL, semi-supervised prediction’ still applies to SSL because it is a learning algorithm
learning; SVM, support vector machine. guided by information contained in the labeled data, like other
Bold numbers represent best performance among the four models.
machine learning algorithms. To compensate for the lack of

Kim J, et al. J Am Med Inform Assoc 2013;20:613–618. doi:10.1136/amiajnl-2012-001570 617


Research and applications

Figure 6 Performance comparison


using artificial neural networks (ANN),
support vector machines (SVM),
semi-supervised learning (SSL), and
SSL Co-training: (A) accuracy and (B)
area under the curve (AUC).

Downloaded from https://academic.oup.com/jamia/article-abstract/20/4/613/819025 by guest on 17 January 2020


labeled data, therefore, SSL Co-training was proposed in this 11 Andoy RK, Zhangz T. A high-performance semi-supervised learning method for text
paper. Our proposed algorithm generates pseudo-labels by chunking. In: Knight K, Ng HT, Oflazer K, eds. Proceedings of the 43rd Annual
Meeting on Association for Computational Linguistics. Ann Arbor, Michigan,
co-training multiple SSL member models, which assign them to 2005:1–9.
unlabeled data before treating them as if they were labeled. As 12 Zhong S. Semi-supervised model-based document clustering: a comparative study.
the process iterates, the labeled data increase and the predictive Mach Learn 2006;65:3–29.
performance of SSL also increases. An empirical validation of 13 Wei L, Keogh E. Semi-supervised time series classification. In: Proceedings of the
12th International Conference on Knowledge Discovery and Data Mining.
SSL Co-training using the SEER breast cancer database demon-
Philadelphia (KDD 2006), USA, 2006:748–53.
strated its superior performance compared with the most repre- 14 Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene
sentative machine learning algorithms such as ANN, SVM, and expression data. PLoS Biol 2004;2:0511–22.
ordinary SSL. Using pseudo-labeled patient data, as well as 15 Gong YC, Chen CL. Semi-supervised method for gene expression data classification
labeled and unlabeled patient data, will improve the technical with gaussian fields and harmonic functions. In: Proceedings of 19th International
Conference on Pattern Recognition. Tampa, FL, 2008:1–4.
quality of the prognosis of cancer survivability, which is 16 Morsillo N, Pal C, Nelson R. Semi-supervised learning of visual classifiers
expected to lead to better treatment for cancer patients. from web images and text. In: Boutilier C, ed. Proceedings of the 21st International
Our proposed SSL Co-training approach remains in a nascent Joint Conference on Artificial Intelligence. Pasadena, California, USA,
stage. Therefore, further studies should be carried out in the 2009:1169–74.
17 Celikyilmaz A, Thint M, Huang Z. A graph-based semi-supervised learning for
near future. The composition of the member models for
question-answering. In: Proceedings of the 47th Annual Meeting of
co-training will be addressed in future research, that is, we need Annual Meeting of the Association for Computational Linguistics. Singapore,
to determine the optimum member size and how to make them 2009:719–27.
sufficiently diverse. More sophisticated methods are also 18 Liu R, Zhou J, Liu M. Graph-based semi-supervised learning algorithm for page
required in the pseudo-labeling process, that is, we need to set classification. In: Proceedings of the Sixth International Conference on
Intelligent Systems Design and Applications. China: IEEE Computer Society,
the cutoff value to improve the confidence of labeling. 2006:856–60.
19 Shin H, Kim D, Park K, et al. Breast cancer survivability prediction with surveillance,
Acknowledgements The authors would like to acknowledge gratefully the epidemiology, and end results satabase. Seoul, Korea: TBC, 2011.
support from Post Brain Korea 21 and a research grant from the National Research 20 SEER. Surveillance, Epidemiology and End Results program National Cancer
Foundation of the Korean government (2012-0000994/2010-0007804). Institute. 2010. http://www.seer.cancer.gov (accessed 11 Jul 2011).
Competing interests None. 21 Choi I, Shin H. Semi-supervised learning with ensemble learning and graph
sharpening. In: Colin F, Kim DS, Lee SY, eds. Proceedings of the 9th International
Provenance and peer review Not commissioned; externally peer reviewed.
Conference on Intelligent Data Engineering and Automated Learning. Daejeon,
South Korea, 2008:172–9.
REFERENCES 22 Shin H, Hill NJ, Lisewski AM, et al. Graph sharpening. Expert Syst Appl
1 American Cancer Society. Cancer Facts & Figures 2010. Atlanta: American Cancer 2010;37:7870–9.
Society, 2010. 23 Shin H, Lisewski AM, Lichtarge O. Graph sharpening plus graph integration:
2 National Cancer Institute. Breast Cancer Statistics, USA, 2010, National Cancer a synergy that improves protein functional classification. Bioinformatics
Institute, 2010. http://www.cancer.gov/cancertopics/types/breast (accessed: 11 Jul 2007;23:3217–24.
2011). 24 He J, Carbonell J, Liu Y. Graph-based semi-supervised learning as a generative
3 Sun Y, Goodison S, Li J, et al. Improved breast cancer prognosis through the model. Veloso MM, ed. In: Proceedings of the 20th International Joint Conference
combination of clinical and genetic markers. Bioinformatics 2007;23:30–7. on Artificial Intelligence. Hyderabad, India, 2007:2492–7.
4 Khan U, Shin H, Choi JP, et al. wFDT—Weighted Fuzzy decision trees for prognosis 25 Chapelle O, Schölkopf B, Zien A. Semi-supervised learning. Cambridge, England:
of breast cancer survivability. In: Roddick JF, Li J, Christen P, Kennedy PJ, eds. The MIT Press, 2006:3–14.
Proceedings of the Seventh Australasian Data Mining Conference. Glenelg, South 26 Wang J. Efficient large margin semi-supervised learning. J Mach Learn Res
Australia, 2008:141–52. 2007;10:719–42.
5 Delen D, Walker G, Kadam A. Predicting breast cancer survivability: a comparison of 27 Zhu X. Semi-supervised learning literature survey. Computer Sciences TR 1530
three data mining methods. Artif Int Med 2005;34:113–27. Madison, University of Wisconsin, 2008.
6 Brenner H, Gefeller O, Hakulinen T. A computer program for period analysis of 28 Belkin M, Matveeva I, Niyogi P. Regularization and Semi-supervised Learning
cancer patient survival. Eur J Cancer 2002;38:690–5. on Large Graphs. In: Lecture notes in computer science. Springer,
7 Cruz JA, Wishart DS. Applications of machine learning in cancer prediction and 2004;3120:624–38.
prognosis. Cancer Inform 2006;2:59–78. 29 Chapelle O, Weston J, Schölkopf B. Cluster kernels for semi-supervised learning. In:
8 Thongkam J, Xu G, Zhang Y, et al. Breast cancer survivability via AdaBoost Advances in neural information processing systems. Cambridge, England: The MIT
algorithms. In: Warren JR, Yu P, Yearwood J, Patrick JD, eds. Proceedings of the Press, 2003:585–92.
Second Australasian Workshop on Health Data and Knowledge Management. 30 Allouche O, Tsoar A, Kadmon R. Assessing the accuracy of species distribution
Wollongong, NSW, Australia, 2008:55–64. models: prevalence, kappa and the true skill statistic (TSS). J Appl Ecol
9 Thongkam J, Xu G, Zhang Y, et al. Towards breast cancer survivability prediction 2006;43:1223–32.
models through improving training space. Expert Syst Appl 2009;36:12200–09. 31 Abraham A. Artificial neural networks. Sydenham P, Thorn R, eds. In: Handbook of
10 Subramanya A, Bilmes J. Soft-supervised learning for text classification. In: Measuring System Design. London: John Wiley & Sons Inc, 2005.
Proceedings of the Conference on Empirical Methods in Natural Language 32 Shin H, Cho S. Neighborhood property-based pattern selection for support vector
Processing. Honolulu, Hawaii, 2008:1090–9. machines. Neural Comput 2007;19:816–55.

618 Kim J, et al. J Am Med Inform Assoc 2013;20:613–618. doi:10.1136/amiajnl-2012-001570

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy