Peerj Cs 275
Peerj Cs 275
Peerj Cs 275
ABSTRACT
Background. A conformational B-cell epitope is one of the main components of
vaccine design. It contains separate segments in its sequence, which are spatially close
in the antigen chain. The availability of Ag-Ab complex data on the Protein Data Bank
allows for the development predictive methods. Several epitope prediction models also
have been developed, including learning-based methods. However, the performance of
the model is still not optimum. The main problem in learning-based prediction models
is class imbalance.
Methods. This study proposes CluSMOTE, which is a combination of a cluster-
based undersampling method and Synthetic Minority Oversampling Technique. The
approach is used to generate other sample data to ensure that the dataset of the
conformational epitope is balanced. The Hierarchical DBSCAN algorithm is performed
to identify the cluster in the majority class. Some of the randomly selected data is
taken from each cluster, considering the oversampling degree, and combined with the
minority class data. The balance data is utilized as the training dataset to develop a
conformational epitope prediction. Furthermore, two binary classification methods,
Support Vector Machine and Decision Tree, are separately used to develop model
prediction and to evaluate the performance of CluSMOTE in predicting conformational
B-cell epitope. The experiment is focused on determining the best parameter for optimal
CluSMOTE. Two independent datasets are used to compare the proposed prediction
model with state of the art methods. The first and the second datasets represent the
Submitted 12 May 2019 general protein and the glycoprotein antigens respectively.
Accepted 15 April 2020
Published 1 June 2020 Result. The experimental result shows that CluSMOTE Decision Tree outperformed the
Support Vector Machine in terms of AUC and Gmean as performance measurements.
Corresponding author
Binti Solihah, The mean AUC of CluSMOTE Decision Tree in the Kringelum and the SEPPA 3 test
binti.solihah@mail.ugm.ac.id sets are 0.83 and 0.766, respectively. This shows that CluSMOTE Decision Tree is better
Academic editor than other methods in the general protein antigen, though comparable with SEPPA 3
Sebastian Ventura in the glycoprotein antigen.
Additional Information and
Declarations can be found on
page 12 Subjects Bioinformatics, Data Mining and Machine Learning
DOI 10.7717/peerj-cs.275 Keywords Cluster-based undersampling, SMOTE, Class imbalance, Hybrid sampling, Hierarchi-
cal DBSCAN, Vaccine design
Copyright
2020 Solihah et al.
Distributed under
INTRODUCTION
Creative Commons CC-BY 4.0 A B-cell epitope is among the main components of peptide-based vaccines (Andersen,
OPEN ACCESS Nielsen & Lund, 2006; Zhang et al., 2011; Ren et al., 2015). It can be utilized in
How to cite this article Solihah B, Azhari A, Musdholifah A. 2020. Enhancement of conformational B-cell epitope prediction using CluS-
MOTE. PeerJ Comput. Sci. 6:e275 http://doi.org/10.7717/peerj-cs.275
immunodetection or immunotherapy to induce an immune response (Rubinstein et
al., 2008). Many B-cell epitopes are conformational and originate from separate segments
of an antigen sequence, forming a spatial neighborhood in the antigen-antibody (Ag–Ab)
complex. Identifying epitopes through experiments is tedious and expensive work, and
therefore, there is a high risk of failure. Current progress in bioinformatics makes it possible
to create vaccine designs through 3D visualization of protein antigen. Many characteristics,
including composition, cooperativeness, hydrophobicity, and secondary structure, are
considered in identifying potential substances for an epitope (Kringelum et al., 2013). Since
no dominant characteristic helps experts to easily distinguish epitopes from other parts of
the antigen, the risk of failure is quite high.
The availability of the 3D structure of the Ag–Ab complex in the public domain
and computational resources eases the development of predictive models using various
methods, including the structure and sequence-based approaches. However, the
conformational epitope prediction is still challenging. The structure-based approach
can be divided into three, including dominant-characteristic-based, graph-based, and
learning-based categories.
There are several characteristic-based approaches, including (1) CEP, which uses solvent-
accessibility properties, (2) Discotope using both solvent-accessibility-based properties
and epitope log odds ratio of amino acid, (3) PEPITO that adds half-sphere exposure
(HSE) to log odds ratio of amino acid in Discotope and (4) Discotope 2.0, which is an
improved version of Discotope. It defines the log odd ratios in spatial contexts and adds
half-sphere exposure (HSE) as a feature, and (5) SEPPA, which utilizes exposed and
adjacent residual characteristics to form a triangle unit patch (Kulkarni-kale, Bhosle &
Kolaskar, 2005; Andersen, Nielsen & Lund, 2006; Kringelum et al., 2012; Sun et al., 2009).
The dominant-characteristic-based approach is limited by the number of features and the
linear relationships between them.
The graph-based method is yet another critical method, although only two from the same
study were found during the literature review. Zhao et al. (2012) developed a subgraph that
could represent the planar nature of the epitope. Although the model is designed to identify
a single epitope, it can also detect multiples. Zhao et al. (2014) used features extracted from
both antigens and the Ag–Ab interaction, which is expressed by a coupling graph and later
transformed into a general graph.
The learning-based approach utilizes machine-learning to work with a large number
of features. It also uses nonlinear relationships between features to optimize model
performance. Rubinstein, Mayrose & Pupko (2009) used two Naïve Bayesian classifiers to
develop structure-based and sequence-based approaches. SEPPA 2.0 combines amino
acid index (AAindex) characteristics in the SEPPA algorithm in the calculation of cluster
coefficients (Qi et al., 2014; Kawashima et al., 2008). Aaindex in SEPPA 2.0 is consolidated
via Artificial Neural Networks (ANN). However, SEPPA 3.0 adds the glycosylation triangles
and glycosylation-related AAindex to SEPPA 2.0 (Zhou et al., 2019). Glycosylation-related
AAindex is consolidated to SEPPA 3.0 via ANN. Several researchers utilized the advantages
of random forest (Dalkas & Rooman, 2017; Jespersen et al., 2017; Ren et al., 2014; Zhang et
al., 2011). The main challenge in developing a conformational B-cell epitope prediction
Data preprocessing
The creation of feature vectors and epitope annotations for the training and testing data
is conducted on surface residues only. Relatively accessible surface area (RSA) is used as a
parameter to distinguish surface and buried residues. Different values were used as limits,
including the 0.05, 0.01, 0.1, and 0.25 thresholds (Rubinstein, Mayrose & Pupko, 2009;
Zhang et al., 2011; Kringelum et al., 2012; Ren et al., 2014; Dalkas & Rooman, 2017). This
variation affects the imbalance ratio between the data epitope and non-epitope classes.
Although the standard burial and non-burial threshold are 0.05, the value of 0.01 is used
as the limit. This is because of the larger the surface exposure threshold, the smaller the
predictive performance (Basu, Bhattacharyya & Banerjee, 2011; Kringelum et al., 2012).
Choosing 0.01 as the limit is relevant to the finding of Zheng et al. (2015), where all RSA
values of epitopes are positive, though slightly larger than zero.
The feature vectors used include accessible surface area (ASA), RSA, depth index (DI),
protrusion index (PI), contact number (CN), HSE, quadrant sphere exposure (QSE),
AAindex, B factor, and log odds ratio, as shown in Table 1.
ASA and RSA are the key features in determining if a residue is likely to bind to other
molecules for accessibility reasons. Although several programs can be used to calculate
ASA, the most commonly used include NACCESS and DSSP (Hubbard & Thornton, 1993;
Kabsch & Sander, 1983). DSSP only calculates the total ASA per residue, while NACCESS
computes the backbone, side chain, polar, and nonpolar ASA. However, NACCESS can
only count one molecular structure at a time. These users need to create additional scripts
to count several molecular structures at a time (Mihel et al., 2008). This study used the
PSAIA application was used (Mihel et al., 2008). The PSAIA is not only limited to counting
one molecular structure but can be used to calculate other features, including RSA, PI,
and DI. No significant difference is observed between the ASA calculation results using
Data preprocessing
(feature extraction)
CLuSMOTE
3D Structure of
Classification Model
Antigen complex
Epitope candidate
NACCESS and PSAIA. The ASA attribute values used include the backbone, side chain,
polar (including oxygen, nitrogen, and phosphorus), and nonpolar atoms (carbon atoms).
RSA is the result of the ASA value with the maximum figure calculated based on the GXG
tripeptide theory, where G is glycine and X is the residual sought (Lee & Richards, 1971).
Classification algorithm
Two classification algorithms, SVM and DT, were used to evaluate the performance of
CluSMOTE. Generally, SVM is a popular learning algorithm used in previous studies of
conformational epitope prediction. DT is often used to handle the class imbalance problem
and classified as one of the top 10 data mining algorithms (Galar et al., 2012).
This study uses the JSAT (Raff, 2017) software package, utilizing the Pegasos SVM with
a mini-batch linear kernel (Shalev-shwartz, Singer & Srebro, 2007). Pegasos SVM works
fast since the primal update process is carried out directly, and no support vector is stored.
The default values used for the epoch, regularization, and batch size parameters include
5, 1e−4, and 1, respectively. The decision tree is formed by nodes that are built on the
principle of decision stump (Iba & Langley, 1992). Also, the study used a bottom-up
pessimistic pruning with error-based pruning from the C4.5 algorithm (Quinland, 1993).
The proportion of the data set used for pruning is 0.1.
No Resampling method r classifier TPR TNR Precision FPR AUC Gmean Adjusted Fmeasure
(recall) (PPV) Gmean
1 Cluster-based only 1 DT 0.855a 0.769 0.454 0.231a 0.812 0.806 0.791 0.581
2 CluSMOTE 2 DT 0.797 0.834 0.526 0.163 0.815a 0.811a 0.823 0.622
3 CluSMOTE 3 DT 0.764 0.862 0.558 0.138 0.813 0.807 0.833 0.634
4 CluSMOTE 4 DT 0.730 0.881 0.575 0.119 0.806 0.796 0.835 0.631
5 CluSMOTE 5 DT 0.724 0.880 0.591 0.120 0.802 0.794 0.834 0.641
a a a
6 SMOTE only – DT 0.644 0.939 0.732 0.061 0.791 0.771 0.848 0.675a
7 No Resampling – DT 0.637 0.939 0.730 0.061 0.788 0.767 0.846 0.669
8 Cluster-based only 1 SVM 0.591b 0.668 0.393 0.328b 0.629 0.579 0.620 0.388
b b
9 CluSMOTE 2 SVM 0.577 0.746 0.441 0.254 0.661 0.60 0.666 0.400
10 CluSMOTE 3 SVM 0.498 0.790 0.486 0.210 0.644 0.580 0.675 0.396
11 CluSMOTE 4 SVM 0.475 0.801 0.508 0.199 0.638 0.566 0.672 0.387
12 CluSMOTE 5 SVM 0.468 0.819 0.529 0.178 0.643 0.572 0.683 0.401b
b b
13 SMOTE only – SVM 0.384 0.881 0.606 0.119 0.632 0.532 0.688 0.368
b
14 No Resampling – SVM 0.409 0.874 0.569 0.126 0.641 0.557 0.699 0.392
Notes.
TPR, True Positive Rate; TNR, True Negaitive Rate; AUC, Area Under ROC Curve; Gmean, Geometric mean.
a
The best parameter value in DT model.
b
The best parameter vaue in SVM model.
shows opposing conditions between the TPR and TNR. From Table 2, the best performance
using AUC and Gmean is fairer compared to Agm and F-score. In the best AUC and AGm, a
balanced proportion was obtained between the TPR and TNR. The best AGm and F-score
resulted from the lowest TPR value. Generally, the performance models built with DT
exhibit better performance than those from SVM. The performance of SVM is likely to
be affected by kernel selection problems. Linear kernels are cannot separate the classes in
polynomial cases. Other configurations or models may be explored for future work.
CONCLUSIONS
An epitope is a small part of the exposed antigen that creates class imbalance problems in
the prediction of learning-based conformational epitopes. In this study, the CluSMOTE
method was proposed to overcome the class imbalance problem in the prediction of the
conformational epitope. The study shows that CluSMOTE considerably increases the TPR
compared to SMOTE only. The comparison of the proposed model with state-of-the-art
methods in the two datasets shows that CluSMOTE DT is comparable to or better than
other methods. Its mean AUC values in Kringelum and the SEPPA 3.0 test sets are 0.83
and 0.766, respectively. This result shows that CluSMOTE DT is better than other methods
in classifying the general protein antigen, though it is comparable to SEPPA 3.0 in the
glycoprotein antigen.
ACKNOWLEDGEMENTS
The authors thank the Publishing and Publication Agency of Universitas Gadjah Mada for
the English proof-reading of this manuscript.
Funding
This work was supported by Universitas Trisakti (Doctoral scholarship). The funders had
no role in study design, data collection and analysis, decision to publish, or preparation of
the manuscript.
Grant Disclosures
The following grant information was disclosed by the authors:
Universitas Trisakti (Doctoral scholarship).
Competing Interests
The authors declare there are no competing interests.
Author Contributions
• Binti Solihah conceived and designed the experiments, performed the experiments,
analyzed the data, performed the computation work, prepared figures and/or tables,
authored or reviewed drafts of the paper, and approved the final draft.
• Azhari Azhari and Aina Musdholifah conceived and designed the experiments, authored
or reviewed drafts of the paper, and approved the final draft.
Supplemental Information
Supplemental information for this article can be found online at http://dx.doi.org/10.7717/
peerj-cs.275#supplemental-information.
REFERENCES
Andersen PH, Nielsen M, Lund OLE. 2006. Prediction of residues in discontinu-
ous B-cell epitopes using protein 3D structures. Protein Science 15:2558–2567
DOI 10.1110/ps.062405906.2558.
Ansari HR, Raghava GPS. 2010. Identification of conformational B-cell Epitopes
in an antigen from its primary sequence. Immunome Research 6(1):1–26
DOI 10.1186/1745-7580-6-6.
Basu S, Bhattacharyya D, Banerjee R. 2011. Mapping the distribution of packing
topologies within protein interiors shows predominant preference for specific
packing motifs. BMC Bioinformatics 12(195)1–26 DOI 10.1186/1471-2105-12-195.
Batuwita R, Palade V. 2009. A new performance measure for class imbalance learning.
Application to bioinformatics problems. In: International conference on machine
learning and applications. Miami Beach, Florida. Florida: IEEE Computer Society,
545–550 DOI 10.1109/ICMLA.2009.126.
Batuwita R, Palade V. 2012. Class imbalance learning methods for support vector. In:
He H, Ma Y, eds. Imbalanced learning: foundations, algorithms, and applications.
Hoboken: John Wiley & Sons, Inc, 83–99.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN,
Bourne PE. 2000. The protein data bank. Nucleic Acids Research 28(1):235–242
DOI 10.1093/nar/28.1.235.
Blaszczynski J, Stefanowski J. 2014. Neighbourhood sampling in bagging for imbalanced
data. Neurocomputing 150:529–542 DOI 10.1016/j.neucom.2014.07.064.
Campello RJGB, Moulavi D, Sander J. 2013. Density-based clustering based on hierar-
chical density estimates. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G, eds. Advances
in knowledge discovery and data mining PAKDD Part II LNAI. Berlin: Springer,
160–172.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: synthetic minority
over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
DOI 10.1613/jair.953.
Chawla NV, Cieslak DA, Hall LO, Joshi A. 2008. Automatically countering imbalance
and its empirical relationship to cost. Data Mining and Knowledge Discovery
17(2):225–252 DOI 10.1007/s10618-008-0087-0.