Dipeptide 2
Dipeptide 2
Dipeptide 2
ANALYSIS AND PREDICTION OF MAJOR BLOOD PROTEINS BASED ON THEIR AMINO ACID
AND DIPEPTIDE COMPOSITION
Abstract- A method has been developed for predicting blood proteins using the SVM based machine learning approach. In this prediction
method a two-step strategy was deployed to predict blood proteins and their subclasses. We have developed models of blood proteins and
achieved the maximum accuracies of 90.57% and 91.39% with Matthews correlation coefficient (MCC) of 0.89 and 0.90 using single amino
acid and dipeptide composition respectively. Furthermore, the method is able to predict major subclasses of blood proteins; developed based
on amino acid (AC) and dipeptide composition (DC) with a maximum accuracy 90.38%, 92.83%, 87.41%, 92.52% and 85.27%, 89.07%,
94.82%, 86.31 for albumin, globulin, fibrinogen, and regulatory proteins respectively. All modules were trained, tested, and evaluated using the
five-fold cross-validation technique.
Keywords- Major Blood Proteins, Amino Acid Composition, Dipeptide Composition, SVM, five-fold cross-validation technique
Citation: Muthukrishnan S., Puri M. and Lefevre C. (2013) Analysis and Prediction of Major Blood Proteins Based on Their Amino Acid and
Dipeptide Composition. International Journal of Bioinformatics Research, ISSN: 0975-3087 & E-ISSN: 0975-9115, Volume 5, Issue 1, pp.-285-
288.
Copyright: Copyright©2013 Muthukrishnan S., et al. This is an open-access article distributed under the terms of the Creative Commons Attrib-
ution License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credit-
ed.
that the method can differentiate blood-proteins from non blood- The dipeptide prediction accuracy was improved significantly over
proteins with great accuracy of 90.57% in 0.89 of MCC at a default single amino acid prediction. Therefore, the prediction accuracy can
cutoff score of 0. The best result was obtained using an optimal be increased using a wide range of information about a protein. The
RBF kernel with parameters g =3, C 375. The dipeptide composition sensitivity and specificity also has been calculated for blood pro-
method was also tested and achieved 91.39% accuracy with 0.90 of teins and subclasses, shown in [Table-1] & [Table-2].
the MCC. In average, all amino and dipeptide composition analysis
Analysis of Amino Acids
of blood-proteins were significantly different from non blood-proteins
[Table-1]. We have applied the same method for predicting the Determining the relative amino acid composition of a protein will
classification of major blood-proteins. Here we took one class of give a characteristic profile for protein. This amino acid analysis
blood proteins as positive and all other classes for negative exam- profile provides enough information to identify major blood-proteins.
ples. This was repeated to all other classes of blood proteins. We Here, we used the total number of amino acid divided by the total
prepared models of each blood protein classes based on their ami- number of amino acids in protein. The average amino acid compo-
no acid as well as dipeptide composition with different optimized sition of blood proteins has been calculated which shown in [Fig-1]
SVM kernels parameters. This indicates that each class of blood- and [Fig-2] with non blood proteins. In this analysis results shows
proteins can be discriminated from other classes of proteins based that Cys, Pro, Ser and Tyr are higher in blood proteins than the non
on their amino acid and dipeptide composition [13-16]. In amino blood proteins. In sub classes of blood proteins Leu residues are
acid composition we achieve maximum accuracy as shown in the higher in all classes, Cys, Asp, Gly and Asn are higher in fibrinogen
[Table-2], 90.38%, 92.83%, 87.41%, 92.52% with MCC of 0.88, proteins than other classes. Overall regulatory and globulin protein
0.85, 0.63, and 0.43 of albumin, globulin, fibrinogen, and regulatory is having a similar percentage in all residues.
proteins respectively. Regulatory and globulin proteins are showing
maximum accuracy (>92%) compared to other classes of blood
proteins. Since the simple amino acid composition provides only
information about frequency but not about local order of residues,
additional SVM module based on dipeptide composition were devel-
oped. While in amino acid composition SVM were provided with an
input vector of dimention 20 for amino acid composition, a vector of
400 dipeptide composition was used for the dipeptide module which
was optimized by g =10, C=400. The results are shown in [Table-2]
and we achieved the maximum accuracy of 85.27%, 89.07%,
94.82%, 86.31% to 0.87, 0.88, 0.87, and 0.70 of the MCC for albu-
min, globulin, fibrinogen, and regulatory proteins respectively. The Fig. 1- Average Amino acid composition chart of blood vs. non-
fibrinogen protein shows the highest accuracy 94.82% among sub- blood proteins
class proteins.
Table 1- Performance of various SVM modules of blood-protein
predictions developed using various types of compositions; amino
acids (AC) and dipeptides (DC).
Parameters
Methods ACC(%) SN(%) SP(%) MCC
γ C
AC 90.57 97.16 84.69 0.89 3 375
DC 91.4 96.77 86.6 0.9 10 400
AC- Amino acid composition, DC- dipeptide composition, ACC- accuracy,
SN- Sensitivity, SP- specificity, MCC- Matthews correlation coefficient,
C- tradeoff value, γ- gamma factor (a parameter in RBF kernel)
Methods
(5)
Dataset
The final data set of blood proteins including subfamilies consist
717 (albumin 91, globulin 33, fibrinogen 564 and regulatory 29). As (6)
negative set 899 belonging to protease family were selected ran-
domly. These protein sequences were obtained from Uniprot and where, TP, TN, FP, FN’s are the number of true positives, true neg-
Expasy server. In this dataset “fragments”, “isoforms”, “potentials”, atives, false positives and false negatives respectively. TP and TN
“similarity”, or “probables” in comment field were removed, to avoid are number of correctly classified blood proteins and non blood
bias in the classifier. We have used 90% cutoff to generate non- proteins respectively. FP and FN are incorrectly classified as blood
redundant dataset of both blood and non-blood sequences. and non blood proteins.
Support Vector Machine (SVM) Prediction System
In the present study, a free downloadable package of SVM, The prediction of blood and blood related proteins are a multi-class
SVM_light has been used to classify major blood-protein sequenc- classification problem. To handle this multi-class situation, we have
es. This software enables the users to define a number of parame- to design a series of binary SVMs. For N class classification, N
ters as well as the choice of inbuilt kernel, such as a radial basis SVMs was constructed. The ith SVM will do training with all sam-
function (RBF) or a polynomial kernel (of given degree) [17]. In this ples of the ith subfamily being labeled as positive, and the samples
study, all parameters of a kernel were kept constant, except for the of all other subfamilies being labeled as negative. The SVMs trains
regulatory parameter C. The experimentation was conducted by in this way will reefer to as 1-v-r SVMs. In this classification ap-
using various types of kernels such as polynomial and radial base proach, each of the unknown proteins will achieve four scores. An
function. The SVMs required a fixed number of inputs for training, unknown protein will be classified into the subfamily that corre-
thus necessitating a strategy for encapsulating the global infor- sponds to the 1-v-r SVM with the highest output score.
mation about the proteins of variable length in a fixed length format.
Acknowledgements
The fixed length format was obtained from protein sequences of
variable length using amino acid and dipeptide composition. It has The authors are thankful to the Department of Biotechnology, Pun-
been successfully applied to numerous classification and pattern jabi University, Patiala and Institute of Microbial Technology, Chan-
recognition problems such as classification of microarray data, pro- digarh, India, for providing necessary facility to conduct research
tein secondary structure prediction and sub cellular localization [3,4, work. The authors acknowledge Deakin University, Australia for
18]. providing web server for this study.
Amino Acid Composition Conflict of Interest: None declared.
Amino acid composition is the fraction of each amino acid in a pro- References
tein. The fraction of all 20 natural amino acids was calculated by [1] Park K.J., Kanehisa M. (2003) Bioinformatics, 19, 1656-1663.
using [Eq-1] [19,20].
[2] Yu C.S., Chen Y.C., Lu C.H., Hwang J.K. (2006) Proteins, 64,
(1) 643-651.
[3] Chang D.T., Ou Y.Y., Hung H.G., Yang M.H., Chen C.Y.,
where i can be any amino acid. Oyang Y.J. (2008) BMC Res. Notes, 1(1), 51.
Dipeptide Composition [4] Watson M., Dukes J., Abu-Median A.B., King D.P., Britton P.,
Dipeptide composition is used to encapsulate the global information Detecti V. (2007) Genome Biol., 8(9), R190.
about each protein sequence, which gives a fixed pattern length of [5] Anderson N.L. and Anderson N.G. (1977) Proceeding of the
400 (20 X 20). The fraction of each dipeptide was calculated using National Academy of Sciences, 74, 5421-5425.
[Eq-2] [19]. [6] Adkins J.N., Varnum S.M., Auberry K.J., Moore R.J., Angell
N.H., Smith R.D., Springer D.L., Pounds J.G. (2002) Molecular
(2) and Cellular Proteomics, 1, 947-955.
where dep (i+1) is one out of 400 dipeptides. [7] Zunszain P.A., Ghuman J., Komatsu T., Tsuchida E., Curry S.
(2003) BMC Structural Biology, 3(1), 6 10.1186/1472-6807-3-6.
Evaluation of Performance
[8] Roux K.H. (1999) Int. Arch Allergy Immunol., 120, 85-99.
In this study, 5-fold cross-validation technique was adopted accord-
ing to which dataset was partitioned randomly into five equal sub- [9] Laurens N., Koolwijk P., de Maat M.P. (2006) J. Thromb. Hae-
sets. The training and testing were carried out five times, each time most., 4(5), 932-939.
using one subset for testing and remaining 4 subsets for training. [10] Bhasin M., Garg A., Raghava G.P.S. (2005) Bioinformatics, 21,
The performance of each classifier is measured in terms of accura- 2522-2524.
cy (ACC), sensitivity (SN), specificity (SP) and Matthews correlation [11] Kumar M., Gromiha M.M., Raghava G.P.S. (2007) BMC Bioin-
coefficient (MCC) by standard [Eq-3] to [Eq-6] [20]. formatics, 8, 463.
[12] Garg A., Bhasin M., Raghava G.P.S. (2005) J. Biol. Chem.,
(3)
280, 14427-14432.
[13] Yu C.S., Chen Y.C., Lu C.H., Hwang J.K. (2006) Proteins, 64,
(4) 643-651.
[14] Cai C.Z., Han L.Y., Ji Z.L., Chen X., Chen Y.Z. (2003) Nucleic
Acids Research, 31, 3692-3697.
[15] Muthukrishnan S., Garg A., Raghava G.P.S. (2007) Genomics,
Proteomics & Bioinformatics, 5, 250-252.
[16] Lata S., Sharma B.K., Raghava G.P.S. (2007) BMC Bioinfor-
matics, 8, 263.
[17] Chou K.C., Shen H.B. (2007) Biochem Biophys Res. Commun.,
360, 339-345.
[18] Nair R., Rost B. (2003) Proteins, 53(4), 917-930.
[19] Bhasin M., Raghava G.P.S. (2004) J. Biol. Chem., 279, 23262-
23266.
[20] Kumar M., Verma R., Raghava G.P.S. (2006) J. Biol. Chem.,
281, 5357-5363.