A Bayesian method for comparing and combining binary classifiers in the absence of a gold standard

BackgroundMany problems in bioinformatics involve classification based on features such as sequence, structure or morphology. Given multiple classifiers, two crucial questions arise: how does their performance compare, and how can they best be combined to produce a better classifier? A classifier can be evaluated in terms of sensitivity and specificity using benchmark, or gold standard, data, that is, data for which the true classification is known. However, a gold standard is not always available. Here we demonstrate that a Bayesian model for comparing medical diagnostics without a gold standard can be successfully applied in the bioinformatics domain, to genomic scale data sets. We present a new implementation, which unlike previous implementations is applicable to any number of classifiers. We apply this model, for the first time, to the problem of finding the globally optimal logical combination of classifiers.ResultsWe compared three classifiers of protein subcellular localisation, and evaluated our estimates of sensitivity and specificity against estimates obtained using a gold standard. The method overestimated sensitivity and specificity with only a small discrepancy, and correctly ranked the classifiers. Diagnostic tests for swine flu were then compared on a small data set. Lastly, classifiers for a genome-wide association study of macular degeneration with 541094 SNPs were analysed. In all cases, run times were feasible, and results precise. The optimal logical combination of classifiers was also determined for all three data sets. Code and data are available from http://bioinformatics.monash.edu.au/downloads/.ConclusionsThe examples demonstrate the methods are suitable for both small and large data sets, applicable to the wide range of bioinformatics classification problems, and robust to dependence between classifiers. In all three test cases, the globally optimal logical combination of the classifiers was found to be their union, according to three out of four ranking criteria. We propose as a general rule of thumb that the union of classifiers will be close to optimal.

[1]  S. Walter,et al.  Estimating the error rates of diagnostic tests. , 1980, Biometrics.

[2]  P. Zhao,et al.  Combining Machine Learning and Homology-Based Approaches to Accurately Predict Subcellular Localization in Arabidopsis1[C][W][OA] , 2010, Plant Physiology.

[3]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[4]  J LunnDavid,et al.  WinBUGS A Bayesian modelling framework , 2000 .

[5]  Andrew Gelman,et al.  R2WinBUGS: A Package for Running WinBUGS from R , 2005 .

[6]  D. Spiegelhalter,et al.  Modelling Complexity: Applications of Gibbs Sampling in Medicine , 1993 .

[7]  Jorge Alberto Achcar,et al.  Bayesian Estimation of Performance Measures of Cervical Cancer Screening Tests in the Presence of Covariates and Absence of a Gold Standard , 2008, Cancer informatics.

[8]  Andrew Thomas,et al.  WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility , 2000, Stat. Comput..

[9]  L. Joseph,et al.  Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. , 1995, American journal of epidemiology.

[10]  S. Hui,et al.  Evaluation of diagnostic tests without gold standards , 1998, Statistical methods in medical research.

[11]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[12]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[13]  P. Visscher,et al.  A versatile gene-based test for genome-wide association studies. , 2010, American journal of human genetics.

[14]  Ying Wang,et al.  Genomewide association study of leprosy. , 2009, The New England journal of medicine.

[15]  Jie Jin Wang,et al.  Sifting the wheat from the chaff: prioritizing GWAS results by identifying consistency across analytical methods , 2011, Genetic epidemiology.

[16]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .