Performance measures in evaluating machine learning based bioinformatics predictors for classifications

BackgroundMany existing bioinformatics predictors are based on machine learning technology. When applying these predictors in practical studies, their predictive performances should be well understood. Different performance measures are applied in various studies as well as different evaluation methods. Even for the same performance measure, different terms, nomenclatures or notations may appear in different context.ResultsWe carried out a review on the most commonly used performance measures and the evaluation methods for bioinformatics predictors.ConclusionsIt is important in bioinformatics to correctly understand and interpret the performance, as it is the key to rigorously compare performances of different predictors and to choose the right predictor.

[1]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[2]  Kevin Y Yip,et al.  Genome-wide analysis of chromatin features identifies histone modification sensitive and insensitive yeast transcription factors , 2011, Genome Biology.

[3]  James B. Brown,et al.  Modeling gene expression using chromatin features in various cellular contexts , 2012, Genome Biology.

[4]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  Trey Ideker,et al.  Proteome-wide discovery of mislocated proteins in cancer , 2013, Genome research.

[6]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[7]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[8]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[9]  Chunhua Wang,et al.  A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature elimination , 2015, Comput. Biol. Chem..

[10]  Euan A Ashley,et al.  The precision medicine initiative: a new national effort. , 2015, JAMA.

[11]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[12]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[13]  P. Simon Too Big to Ignore: The Business Case for Big Data , 2013 .

[14]  Hong-Bin Shen,et al.  LabCaS: Labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields , 2013, Proteins.

[15]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[16]  Xin Wang,et al.  Recent progress in predicting protein sub-subcellular locations , 2011, Expert review of proteomics.

[17]  A. Bachelor GLOSSARY OF TERMS GLOSSARY OF TERMS , 2010 .

[18]  Josefine Sprenger,et al.  Evaluation and comparison of mammalian subcellular localization prediction methods , 2006, BMC Bioinformatics.

[19]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[20]  R. Jiang,et al.  Epistatic Module Detection for Case-Control Studies: A Bayesian Model with a Gibbs Sampling Strategy , 2009, PLoS genetics.

[21]  Yu Xue,et al.  GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. , 2011, Protein engineering, design & selection : PEDS.

[22]  K. Chou,et al.  Recent Progress in Predicting Posttranslational Modification Sites in Proteins. , 2015, Current topics in medicinal chemistry.

[23]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[24]  Liang Kong,et al.  Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou's pseudo amino acid composition. , 2014, Journal of theoretical biology.

[25]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[26]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[27]  E. Marco,et al.  Predicting chromatin organization using histone marks , 2015, Genome Biology.

[28]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[29]  Alan Julian Izenman Model Assessment and Selection in Multiple Regression , 2013 .

[30]  K. Chou,et al.  Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. , 2006, Biochemical and biophysical research communications.

[31]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[32]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[33]  Loris Nanni,et al.  Combining multiple approaches for gene microarray classification , 2012, Bioinform..

[34]  Tingting Li,et al.  Identifying Human Kinase-Specific Protein Phosphorylation Sites by Integrating Heterogeneous Information from Various Sources , 2010, PloS one.

[35]  Feng Luo,et al.  Predicting protein phosphorylation from gene expression: top methods from the IMPROVER Species Translation Challenge , 2014, Bioinform..

[36]  Pufeng Du,et al.  Predicting multisite protein subcellular locations: progress and challenges , 2013, Expert review of proteomics.

[37]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[38]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[39]  Michael J Schell,et al.  The false discovery rate: a key concept in large-scale genetic studies. , 2010, Cancer control : journal of the Moffitt Cancer Center.

[40]  Pu-Feng Du,et al.  Predicting Golgi-resident protein types using pseudo amino acid compositions: Approaches with positional specific physicochemical properties. , 2016, Journal of theoretical biology.

[41]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[42]  Dong Xu,et al.  Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. , 2012, Molecular bioSystems.

[43]  Jianyang Zeng,et al.  Supplementary Material for “ Predicting Drug-Target Interactions Using Restricted Boltzmann Machines ” , 2013 .

[44]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[45]  D. Rujescu,et al.  Improved Detection of Common Variants Associated with Schizophrenia and Bipolar Disorder Using Pleiotropy-Informed Conditional False Discovery Rate , 2013, PLoS genetics.

[46]  Pufeng Du,et al.  Subcellular localization prediction for human internal and organelle membrane proteins with projected gene ontology scores. , 2012, Journal of theoretical biology.

[47]  Zhiping Weng,et al.  The correlation between histone modifications and gene expression. , 2013, Epigenomics.

[48]  C. L. Philip Chen,et al.  Adaptive least squares support vector machines filter for hand tremor canceling in microsurgery , 2011, Int. J. Mach. Learn. Cybern..

[49]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[50]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[51]  Yanda Li,et al.  Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence , 2006, BMC Bioinformatics.

[52]  Qi Zhao,et al.  GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs , 2014, Nucleic Acids Res..

[53]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[54]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[55]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[56]  Junhyong Kim,et al.  The promise of single-cell sequencing , 2013, Nature Methods.

[57]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[58]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[59]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[60]  Michael Q. Zhang,et al.  Network-based global inference of human disease genes , 2008, Molecular systems biology.

[61]  Yuhao Wang,et al.  Predicting drug-target interactions using restricted Boltzmann machines , 2013, Bioinform..

[62]  F. Agakov,et al.  Application of high-dimensional feature selection: evaluation for genomic prediction in man , 2015, Scientific Reports.

[63]  Jeffrey P. Mower PREP-Mt: predictive RNA editor for plant mitochondrial genes , 2005, BMC Bioinformatics.