Breast Cancer (2013)
‘unlabeled’ (features without labels). For patient data related to SEMI-SUPERVISED LEARNING
breast cancer survivability, the label tags a patient as ‘survived’ if In many real-world classification problems, the number of
they survived for a specified period or ‘not survived’ if they did class-labeled data points is small because they are often difficult,
not. Accumulating a large quantity of labeled data is time con- expensive, or time consuming to acquire and they may require
suming, costly, and it requires confidentiality agreements. In qualified human annotators, as described in Choi and Shin21
general, the collection of labeled survival data requires at least and Shin and colleagues.22 23 By contrast, unlabeled data can be
5 years.5 6 Moreover, oncologist consultation fees must be paid gathered easily and it can provide valuable information for
to confirm survivability. Furthermore, doctors and patients learning, as discussed in He et al.24 However, traditional classifi-
seldom reveal their information. Therefore, is it worth waiting cation algorithms such as supervised learning algorithms only
for 5 years to acquire the survival data, while also paying a sig- use labeled data so they encounter difficulties when only a few
nificant fee, expending considerable effort, and persuading labeled data are available. SSL uses labeled and unlabeled data
patients to disclose their personal medical data? By contrast, to improve the performance of supervised learning, as shown
unlabeled data can be collected with much less effort. Censored in He et al24 and Chapelle et al.25 In SSL, the classification
decreasing pattern for the number of boosted samples during conducted so the two sub-sets are maximally uncorrelated, that
the iterations. Figure 3B shows the increasing pattern of the is, the attributes in one set are not correlated with those in the
model performance due to the increasing size of the labeled other set.
data points (note that the performances of the two member clas-
sifiers also increase). The toy example shown in figure 4 is EXPERIMENTS
helpful for understanding the proposed algorithm. Data, performance measurement, and experimental setting
The member composition used for SSL Co-training can be The breast cancer survivability dataset (1973–2003) from SEER
diverse. First, the number of members is not limited, so they was used for the experiment, which is an initiative of the
can be multiple. Second, different member models can be built National Cancer Institute and the premier source for cancer sta-
from different data sources or different model parameters. In tistics in the USA ( SEER claims to
the current study, the two member models, F1 and F2, were have one of the most comprehensive collections of cancer statis-
built by splitting a dataset into two sub-datasets. The split is tics. It includes incidence, mortality, prevalence, survival,
1 Stage Defined by size of cancer tumor and its spread 9 Site-specific surgery Information on surgery during first course of therapy,
whether cancer-directed or not
2 Grade Appearance of tumor and its similarity to more or less 10 Radiation None, beam radiation, radioisotopes, refused,
aggressive tumors recommended, etc.
3 Lymph node None, (1–3) minimal, (4–9) significant, etc 11 Histological type Form and structure of tumor
4 Race Ethnicity: white, black, Chinese, etc 12 Behavior code Normal or aggressive tumor behavior is defined using
5 Age at diagnosis Actual age of patient in years 13 No of positive nodes When lymph nodes are involved in cancer, they are
examined known as positive.
6 Marital status Married, single, divorced, widowed, separated 14 No of nodes examined Total nodes (positive/negative) examined
7 Primary site Presence of tumor at particular location in body. 15 No of primaries No of primary tumors (1–6)
Topographical classification of cancer.
8 Tumor size 2–5 cm; at 5 cm, the prognosis worsens 16 Clinical extension of Defines the spread of the tumor relative to the breast
17 Survivability Target binary variable defines class of survival of patient.
lifetime risk, and statistics by race/ethnicity. The data consists of cancer survivability. The model parameters were searched over