Project Report: Self-Labeled Techniques For Semi-Supervised Learning
Project Report: Self-Labeled Techniques For Semi-Supervised Learning
Project Report: Self-Labeled Techniques For Semi-Supervised Learning
Mark Laane
1 Background
2 Base classifiers
3.2 Tri-Training
The implementation of Tri-Training is based on description of the algorithm in paper
[4]. In Tri-Training, three base classifiers are trained on randomly subsampled sets of
the labeled data. Then each of them will be iteratively trained by also considering the
labels that two other base classifiers have predicted. The predictions from the other two
base classifiers inherently introduce noise by mislabeling some samples. To set an up-
per bound for classification noise rate during the training process a restriction is set in
place. Chosen base classifier will only be retrained if following equation holds
𝑒𝑡 |𝐿𝑡−1 |
0 < 𝑒 𝑡−1 < |𝐿𝑡 |
<1 (1)
where |𝐿𝑡 | is the number of samples that can be added in the t-th round (agreed by
the other two base classifiers) and 𝑒𝑡 is the upper bound of classification error rate of
those labeled samples 𝐿𝑡 in t-th round. If |𝐿𝑡 | is too large to satisfy equation 1, then the
number of samples can be reduced by randomly subsampling the set so it would contain
𝑒 𝑡−1 |𝐿𝑡−1 |
s samples, where 𝑠 = ⌈ − 1⌉. The algorithm continues as long there is at least
𝑒𝑡
one 𝐶𝑖, where equation 1 can be satisfied.
The algorithm can be summarized as follows:
The three trained classifiers can then be used for classification by using majority
voting.
4 Experiments
5 Results
Each self-labeled algorithm was tested with each base classifier the validation was
performed on each dataset. The results can be seen in the table in Appendix 2 and a
selection of confusion matrixes in Appendix 1.
As expected, raising labelling rate also raises the transductive and test accuracy in
most cases. Abalone dataset proves to be more difficult to classify and extra labels do
not always improve the results. There is also an anomalous result with unknown source,
where of Tri-Training used with Naïve Bayes has the highest accuracy on 10% labelling
ratio. Both algorithms perform worst with Naïve Bayes as a base classifier and best
with CART base classifier. Triguero et al also reported good performance with another
similar decision tree algorithm C4.5. On dermatology dataset, both Self-Training with
CART and Tri-Training with CART achieved best results reaching towards 90% of
accuracy.
Generally the classification accuracy results outputted by the implemented algo-
rithms are lower than the results achieved by Triguero et al. indicating a inferior per-
formance of the implementations. Also, high standard deviation indicates poor perfor-
mance, as the results seems to strongly depend on the fold that the data is being tested
on.
6 Conclusions
References