Problems with the nested granularity of feature domains in bioinformatics: the eXtasy case

BackgroundData from biomedical domains often have an inherit hierarchical structure. As this structure is usually implicit, its existence can be overlooked by practitioners interested in constructing and evaluating predictive models from such data. Ignoring these constructs leads to potentially problematic and the routinely unrecognized bias in the models and results. In this work, we discuss this bias in detail and propose a simple, sampling-based solution for it. Next, we explore its sources and extent on synthetic data. Finally, we demonstrate how the state-of-the-art variant prioritization framework, eXtasy, benefits from using the described approach in its Random forest-based core classification model.Results and conclusionsThe conducted simulations clearly indicate that the heterogeneous granularity of feature domains poses significant problems for both the standard Random forest classifier and a modification that relies on stratified bootstrapping. Conversely, using the proposed sampling scheme when training the classifier mitigates the described bias. Furthermore, when applied to the eXtasy data under a realistic class distribution scenario, a Random forest learned using the proposed sampling scheme displays much better precision that its standard version, without degrading recall. Moreover, the largest performance gains are achieved in the most important part of the operating range: the top of prioritized gene list.

[1]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[2]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[3]  A Ziegler,et al.  EDITOR Comments on ‘Practical experiences on the necessity of external validation’ , 2008 .

[4]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[5]  Insuk Lee,et al.  Characterising and Predicting Haploinsufficiency in the Human Genome , 2010, PLoS genetics.

[6]  P. Stankiewicz,et al.  Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. , 2010, The New England journal of medicine.

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  N. Stanietsky,et al.  The interaction of TIGIT with PVR and PVRL2 inhibits human NK cell cytotoxicity , 2009, Proceedings of the National Academy of Sciences.

[9]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[10]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[11]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[12]  BMC Bioinformatics , 2005 .

[13]  Fiona Cunningham,et al.  A Combined Functional Annotation Score for Non-Synonymous Variants , 2012, Human Heredity.

[14]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[15]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[16]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[17]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[18]  Vanessa E. Gray,et al.  Evolutionary diagnosis method for variants in personal exomes , 2012, Nature Methods.

[19]  Bart De Moor,et al.  eXtasy: variant prioritization by genomic data fusion , 2013, Nature Methods.

[20]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[21]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[22]  M. Vihinen How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis , 2012, BMC Genomics.

[23]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[24]  J.,et al.  The New England Journal of Medicine , 2012 .

[25]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[26]  Jana Marie Schwarz,et al.  MutationTaster evaluates disease-causing potential of sequence alterations , 2010, Nature Methods.

[27]  Roumiana Tsenkova,et al.  Near Infrared Spectroscopy Using Short Wavelengths and Leave-One-Cow-Out Cross-Validation for Quantification of Somatic Cells in Milk , 2009 .

[28]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[29]  Steven Salzberg,et al.  Detection of lineage-specific evolutionary changes among primate species , 2011, BMC Bioinformatics.

[30]  E. Boerwinkle,et al.  dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[31]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.