A phenotype centric benchmark of variant prioritisation tools

Next generation sequencing is a standard tool used in clinical diagnostics. In Mendelian diseases the challenge is to discover the single etiological variant among thousands of benign or functionally unrelated variants. After calling variants from aligned sequencing reads, variant prioritisation tools are used to examine the conservation or potential functional consequences of variants. We hypothesised that the performance of variant prioritisation tools may vary by disease phenotype. To test this we created benchmark data sets for variants associated with different disease phenotypes. We found that performance of 24 tested tools is highly variable and differs by disease phenotype. The task of identifying a causative variant amongst a large number of benign variants is challenging for all tools, highlighting the need for further development in the field. Based on our observations, we recommend use of five top performers found in this study (FATHMM, M-CAP, MetaLR, MetaSVM and VEST3). In addition we provide tables indicating which analytical approach works best in which disease context. Variant prioritisation tools are best suited to investigate variants associated with well-studied genetic diseases, as these variants are more readily available during algorithm development than variants associated with rare diseases. We anticipate that further development into disease focussed tools will lead to significant improvements.Genomic analysis: Tools for prioritizing gene variants impacted by disease contextThe performance of software tools used to distinguish disease-causing genetic variants depends on the type of disease under investigation. Denise Anderson and Timo Lassmann from the Telethon Kids Institute in Subiaco, Australia, compared 24 software tools commonly used to narrow down DNA variants found in next-generation sequencing data to those likely to cause a particular disease. By looking at more than 4000 disease phenotypes, the researchers found that the different prioritisation tools, owing to their different methodologies and algorithms, varied in their ability to discriminate between pathogenic and benign gene variants. The top-performing tools all used machine-learning techniques and worked best in cases of well-studied genetic diseases. The findings highlight the need for additional disease-focused tool development, and offer a resource to help researchers decide which approach to use in different disease contexts.

[1]  Tommaso Mazza,et al.  Congruency in the prediction of pathogenic missense mutations: state-of-the-art web-based tools , 2013, Briefings Bioinform..

[2]  J. Miller,et al.  Predicting the Functional Effect of Amino Acid Substitutions and Indels , 2012, PloS one.

[3]  E. Boerwinkle,et al.  dbNSFP: A Lightweight Database of Human Nonsynonymous SNPs and Their Functional Predictions , 2011, Human mutation.

[4]  Ernest Turro,et al.  ontologyX: a suite of R packages for working with ontological data , 2016, Bioinform..

[5]  H. Carter,et al.  Identifying Mendelian disease genes with the Variant Effect Scoring Tool , 2013, BMC Genomics.

[6]  J. Lehmann-Che,et al.  Resistance to therapy in acute promyelocytic leukemia. , 2014, The New England journal of medicine.

[7]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[8]  Thomas C. Wiegers,et al.  MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database , 2012, Database J. Biol. Databases Curation.

[9]  Gill Bejerano,et al.  M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity , 2016, Nature Genetics.

[10]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[11]  J. Reis-Filho,et al.  Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations , 2014, Genome Biology.

[12]  Daniele Merico,et al.  Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test , 2017, Genetics in Medicine.

[13]  Gert Matthijs,et al.  Guidelines for diagnostic next-generation sequencing , 2016, European Journal of Human Genetics.

[14]  Insuk Lee,et al.  Characterising and Predicting Haploinsufficiency in the Human Genome , 2010, PLoS genetics.

[15]  Jana Marie Schwarz,et al.  MutationTaster2: mutation prediction for the deep-sequencing age , 2014, Nature Methods.

[16]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[17]  J. Keilwagen,et al.  Area under Precision-Recall Curves for Weighted and Unweighted Data , 2014, PloS one.

[18]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[19]  Renaud Gaujoux,et al.  A flexible R package for nonnegative matrix factorization , 2010, BMC Bioinformatics.

[20]  Kei-Hoi Cheung,et al.  A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data , 2015, Scientific Reports.

[21]  A. Siepel,et al.  Probabilities of Fitness Consequences for Point Mutations Across the Human Genome , 2014, Nature Genetics.

[22]  D. Goldstein,et al.  Correction: Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes , 2013, PLoS Genetics.

[23]  Joseph K. Pickrell,et al.  A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes , 2012, Science.

[24]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[25]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[26]  Xiaohui Xie,et al.  DANN: a deep learning approach for annotating the pathogenicity of genetic variants , 2015, Bioinform..

[27]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[28]  Yihui Xie,et al.  A Wrapper of the JavaScript Library 'DataTables' , 2015 .

[29]  Bale,et al.  Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology , 2015, Genetics in Medicine.

[30]  Gert Matthijs,et al.  Guidelines for diagnostic next-generation sequencing , 2015, European Journal of Human Genetics.

[31]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[32]  Leslie G Biesecker,et al.  Diagnostic clinical genome and exome sequencing. , 2014, The New England journal of medicine.

[33]  Thomas Schlitt,et al.  Predicting the functional consequences of non-synonymous DNA sequence variants--evaluation of bioinformatics tools and development of a consensus strategy. , 2013, Genomics.

[34]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[35]  David Haussler,et al.  New Methods for Detecting Lineage-Specific Selection , 2006, RECOMB.

[36]  Saumya Shekhar Jamuar,et al.  Clinical application of next-generation sequencing for Mendelian diseases , 2015, Human Genomics.

[37]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[38]  Michael Brudno,et al.  Whole-genome sequencing expands diagnostic utility and improves clinical management in paediatric medicine , 2016, npj Genomic Medicine.

[39]  Matthew S. Lebo,et al.  The Impact of Whole-Genome Sequencing on the Primary Care and Outcomes of Healthy Adult Patients: A Pilot Randomized Trial. , 2017, Annals of internal medicine.

[40]  Leif Groop,et al.  LoFtool: a gene intolerance score based on loss‐of‐function variants in 60 706 individuals , 2016, Bioinform..

[41]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[42]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[43]  M. Acencio,et al.  HTRIdb: an open-access database for experimentally verified human transcriptional regulation interactions , 2012, BMC Genomics.

[44]  Rémy Bruggmann,et al.  Clinical sequencing: is WGS the better WES? , 2016, Human Genetics.

[45]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[46]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[47]  Serafim Batzoglou,et al.  Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP++ , 2010, PLoS Comput. Biol..

[48]  M. Vihinen,et al.  Performance of mutation pathogenicity prediction methods on missense variants , 2011, Human mutation.

[49]  Rachel Karchin,et al.  Towards Increasing the Clinical Relevance of In Silico Methods to Predict Pathogenic Missense Variants , 2016, PLoS Comput. Biol..

[50]  Xiaohui Xie,et al.  Identifying novel constrained elements by exploiting biased substitution patterns , 2009, Bioinform..

[51]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[52]  Hui Yang,et al.  Phenolyzer: phenotype-based prioritization of candidate genes for human diseases , 2015, Nature Methods.

[53]  Lei Shang,et al.  Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants , 2014, Proceedings of the National Academy of Sciences.

[54]  Daniel R. Zerbino,et al.  Ensembl 2016 , 2015, Nucleic Acids Res..

[55]  Sayaka Hashimoto,et al.  Variability in pathogenicity prediction programs: impact on clinical diagnostics , 2014, Molecular genetics & genomic medicine.

[56]  D. Goldstein,et al.  Genic Intolerance to Functional Variation and the Interpretation of Personal Genomes , 2013, PLoS genetics.

[57]  Justin C. Fay,et al.  Identification of deleterious mutations within three human genomes. , 2009, Genome research.

[58]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[59]  R. Gibbs,et al.  Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies. , 2015, Human molecular genetics.

[60]  Dipanwita Roy Chowdhury,et al.  Human protein reference database as a discovery resource for proteomics , 2004, Nucleic Acids Res..

[61]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[62]  Michael J. Lush,et al.  genenames.org: the HGNC resources in 2011 , 2010, Nucleic Acids Res..

[63]  Karsten M. Borgwardt,et al.  The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity , 2015, Human mutation.

[64]  Colin Campbell,et al.  An integrative approach to predicting the functional effects of non-coding and coding sequence variation , 2015, Bioinform..

[65]  S. Mundlos,et al.  The Human Phenotype Ontology , 2010, Clinical genetics.

[66]  E. Boerwinkle,et al.  dbNSFP v3.0: A One‐Stop Database of Functional Predictions and Annotations for Human Nonsynonymous and Splice‐Site SNVs , 2016, Human mutation.

[67]  J. Buxbaum,et al.  A SPECTRAL APPROACH INTEGRATING FUNCTIONAL GENOMIC ANNOTATIONS FOR CODING AND NONCODING VARIANTS , 2015, Nature Genetics.

[68]  F. Dhombres,et al.  Representation of rare diseases in health information systems: The orphanet approach to serve a wide range of end users , 2012, Human mutation.

[69]  Renata C. Geer,et al.  The NCBI BioSystems database , 2009, Nucleic Acids Res..