Matching phenotypes to whole genomes: Lessons learned from four iterations of the personal genome project community challenges

The advent of next‐generation sequencing has dramatically decreased the cost for whole‐genome sequencing and increased the viability for its application in research and clinical care. The Personal Genome Project (PGP) provides unrestricted access to genomes of individuals and their associated phenotypes. This resource enabled the Critical Assessment of Genome Interpretation (CAGI) to create a community challenge to assess the bioinformatics community's ability to predict traits from whole genomes. In the CAGI PGP challenge, researchers were asked to predict whether an individual had a particular trait or profile based on their whole genome. Several approaches were used to assess submissions, including ROC AUC (area under receiver operating characteristic curve), probability rankings, the number of correct predictions, and statistical significance simulations. Overall, we found that prediction of individual traits is difficult, relying on a strong knowledge of trait frequency within the general population, whereas matching genomes to trait profiles relies heavily upon a small number of common traits including ancestry, blood type, and eye color. When a rare genetic disorder is present, profiles can be matched when one or more pathogenic variants are identified. Prediction accuracy has improved substantially over the last 6 years due to improved methodology and a better understanding of features.

[1]  Tom R. Gaunt,et al.  Predicting the Functional, Molecular, and Phenotypic Consequences of Amino Acid Substitutions using Hidden Markov Models , 2012, Human mutation.

[2]  Hai Fang,et al.  The SUPERFAMILY 1.75 database in 2014: a doubling of data , 2014, Nucleic Acids Res..

[3]  V. McKusick Mendelian Inheritance in Man and Its Online Version, OMIM , 2007, The American Journal of Human Genetics.

[4]  Cheng Wang,et al.  A Probabilistic Model to Predict Clinical Phenotypic Traits from Genome Sequencing , 2014, PLoS Comput. Biol..

[5]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[6]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[7]  Gail Javitt,et al.  ASHG Statement* on Direct-to-Consumer Genetic Testing in the United States , 2007, Obstetrics and gynecology.

[8]  Manfred Kayser,et al.  The HIrisPlex system for simultaneous prediction of hair and eye colour from DNA. , 2013, Forensic science international. Genetics.

[9]  Predrag Radivojac,et al.  Ten Simple Rules for a Community Computational Challenge , 2015, PLoS Comput. Biol..

[10]  J. Manson,et al.  Birthweight and the Risk for Type 2 Diabetes Mellitus in Adult Women , 1999, Annals of Internal Medicine.

[11]  Jacob A. Tennessen,et al.  Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes , 2012, Science.

[12]  David L. Masica,et al.  Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST‐Indel) , 2015, Human mutation.

[13]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.

[14]  H. Carter,et al.  Identifying Mendelian disease genes with the Variant Effect Scoring Tool , 2013, BMC Genomics.

[15]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[16]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[17]  Hai Fang,et al.  dcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more , 2012, Nucleic Acids Res..

[18]  Madeleine P. Ball,et al.  Harvard Personal Genome Project: lessons from participatory public research , 2014, Genome Medicine.

[19]  A. Gonzalez-Perez,et al.  Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. , 2011, American journal of human genetics.

[20]  Peng Yue,et al.  SNPs3D: Candidate gene and SNP selection for association studies , 2006, BMC Bioinformatics.

[21]  D. Anstee,et al.  Red cell genotyping and the future of pretransfusion testing. , 2009, Blood.

[22]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[23]  M. Rieder,et al.  Use of Pharmacogenetic and Clinical Factors to Predict the Therapeutic Dose of Warfarin , 2008, Clinical pharmacology and therapeutics.

[24]  I. Adzhubei,et al.  Predicting Functional Effect of Human Missense Mutations Using PolyPhen‐2 , 2013, Current protocols in human genetics.

[25]  W. Bodmer,et al.  Common and rare variants in multifactorial susceptibility to common diseases , 2008, Nature Genetics.

[26]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[27]  D. Cooper,et al.  Human Gene Mutation Database , 1996, Human Genetics.

[28]  W. J. Kent,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.