Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies

In many bioinformatics applications, it is important to assess and compare the performances of algorithms trained from data, to be able to draw conclusions unaffected by chance and are therefore significant. Both the design of such experiments and the analysis of the resulting data using statistical tests should be done carefully for the results to carry significance. In this paper, we first review the performance measures used in classification, the basics of experiment design and statistical tests. We then give the results of our survey over 1,500 papers published in the last two years in three bioinformatics journals (including this one). Although the basics of experiment design are well understood, such as resampling instead of using a single training set and the use of different performance metrics instead of error, only 21 percent of the papers use any statistical test for comparison. In the third part, we analyze four different scenarios which we encounter frequently in the bioinformatics literature, discussing the proper statistical methodology as well as showing an example case study for each. With the supplementary software, we hope that the guidelines we discuss will play an important role in future studies.

[1]  Kai Yu,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  J. Shaffer Modified Sequentially Rejective Multiple Test Procedures , 1986 .

[3]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[4]  Shili Lin,et al.  Sparse Support Vector Machines with L_{p} Penalty for Biomarker Identification , 2010, TCBB.

[5]  Robert P. W. Duin,et al.  Precision-recall operating characteristic (P-ROC) curves in imprecise environments , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[6]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[7]  Mehryar Mohri,et al.  Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[8]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[11]  Elizabeth Tapia,et al.  Multiclass classification of microarray data samples with a reduced number of genes , 2011, BMC Bioinformatics.

[12]  Inderjeet Mani,et al.  Machine Learning of User Profiles: Representational Issues , 1996, AAAI/IAAI, Vol. 1.

[13]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[14]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[15]  Robert G. Beiko,et al.  Efficient learning of microbial genotype-phenotype association rules , 2010, Bioinform..

[16]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[17]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[18]  Ethem Alpaydin,et al.  Cost-conscious comparison of supervised learning algorithms over multiple data sets , 2012, Pattern Recognit..

[19]  Samy Bengio,et al.  The Expected Performance Curve , 2003, ICML 2003.

[20]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[21]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[23]  Charles X. Ling,et al.  AUC: A Better Measure than Accuracy in Comparing Learning Algorithms , 2003, Canadian Conference on AI.

[24]  Robert C. Holte,et al.  Cost curves: An improved method for visualizing classifier performance , 2006, Machine Learning.

[25]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[26]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[27]  S.J.J. Smith,et al.  Empirical Methods for Artificial Intelligence , 1995 .

[28]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[29]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[30]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[31]  Remco R. Bouckaert,et al.  Estimating replicability of classifier learning experiments , 2004, ICML.

[32]  Ethem Alpaydin,et al.  Ordering and finding the best of K > 2 supervised learning algorithms , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[34]  Ming Tan,et al.  Sparse Support Vector Machines with L_{p} Penalty for Biomarker Identification , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Héctor Corrada Bravo,et al.  Examining the relative influence of familial, genetic, and environmental covariate information in flexible risk models , 2009, Proceedings of the National Academy of Sciences.

[36]  Jingjing Lu,et al.  Comparing naive Bayes, decision trees, and SVM with AUC and accuracy , 2003, Third IEEE International Conference on Data Mining.

[37]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[38]  Stéphan Clémençon,et al.  Nonparametric estimation of the precision-recall curve , 2009, ICML '09.

[39]  Ethem Alpaydin,et al.  Linear Discriminant Trees , 2000, ICML.

[40]  Alex Bateman,et al.  The rise and fall of supervised machine learning techniques , 2011, Bioinform..

[41]  G. Hommel,et al.  Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses , 1988 .

[42]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[43]  Taghi M. Khoshgoftaar,et al.  Comparison of Four Performance Metrics for Evaluating Sampling Techniques for Low Quality Class-Imbalanced Data , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[44]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[45]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[46]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[47]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[48]  Pierre Baldi,et al.  A CROC stronger than ROC: measuring, visualizing and optimizing early retrieval , 2010, Bioinform..

[49]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[50]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.