Who’s who? Detecting and resolving sample anomalies in human DNA sequencing studies with peddy

The potential for genetic discovery in human DNA sequencing studies is greatly diminished if DNA samples from the cohort are mislabelled, swapped, contaminated, or include unintended individuals. Unfortunately, the potential for such errors is significant since DNA samples are often manipulated by several protocols, labs or scientists in the process of sequencing. We have developed peddy to identify and facilitate the remediation of such errors via interactive visualizations and reports comparing the stated sex, relatedness, and ancestry to what is inferred from each individual’s genotypes. Peddy predicts a sample’s ancestry using a machine learning model trained on individuals of diverse ancestries from the 1000 Genomes Project reference panel. Peddy’s speed, text reports and web interface facilitate both automated and visual detection of sample swaps, poor sequencing quality and other indicators of sample problems that, were they left undetected, would inhibit discovery. Software Availability https://github.com/brentp/peddy Demonstration (Chrome suggested) http://home.chpc.utah.edu/∼u6000771//plots/ceph1463.html

[1]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[2]  Richard Boada,et al.  The cognitive phenotype in Klinefelter syndrome: a review of the literature including genetic and hormonal factors. , 2009, Developmental disabilities research reviews.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  Josyf Mychaleckyj,et al.  Robust relationship inference in genome-wide association studies , 2010, Bioinform..

[5]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[6]  Heng Li,et al.  Tabix: fast retrieval of sequence features from generic TAB-delimited files , 2011, Bioinform..

[7]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[8]  Gabor T Marth,et al.  bam.iobio: a web-based, real-time, sequence alignment file inspector , 2014, Nature Methods.

[9]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[10]  Gil McVean,et al.  A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016 .

[11]  Alejandro Q. Nato,et al.  Estimating relationships between phenotypes and subjects drawn from admixed families , 2016, BMC Proceedings.

[12]  James G. Wilson,et al.  PRIMUS: improving pedigree reconstruction using mitochondrial and Y haplotypes , 2016, Bioinform..

[13]  Peter N. Robinson,et al.  A likelihood ratio-based method to predict exact pedigrees for complex families from next-generation sequencing data , 2017, Bioinform..

[14]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.