Detection of long repeat expansions from PCR-free whole-genome sequence data

Identifying large repeat expansions such as those that cause amyotrophic lateral sclerosis (ALS) and Fragile X syndrome is challenging for short-read (100-150 bp) whole genome sequencing (WGS) data. A solution to this problem is an important step towards integrating WGS into precision medicine. We have developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3,001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Taking the RP-PCR calls as the ground truth, our WGS-based method identified pathogenic repeat expansions with 98.1% sensitivity and 99.7% specificity. Further inspection identified that all 11 conflicts were resolved as errors in the original RP-PCR results. Compared against this updated result, ExpansionHunter correctly classified all (212/212) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2,786/2,789) of the wild type samples were correctly classified as wild type by this method with the remaining two identified as possible expansions. We further applied our algorithm to a set of 144 samples where every sample had one of eight different pathogenic repeat expansions including examples associated with fragile X syndrome, Friedreich’s ataxia and Huntington’s disease and correctly flagged all of the known repeat expansions. Finally, we tested the accuracy of our method for short repeats by comparing our genotypes with results from 860 samples sized using fragment length analysis and determined that our calls were >95% accurate. ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.

Chris Shaw | Giuseppe Narzisi | Aleksey Shatunov | Egor Dolzhenko | Michael A Eberle | Ammar Al-Chalabi | Joke J F A van Vugt | Bryan R Lajoie | Ryan J Taft | David R Bentley | Orla Hardiman | Vani Rajan | Raymond D. Schellevis | William Sproviero | Christopher W Ng | Leonard H van den Berg | S. S. Ajay | D. Bentley | M. Eberle | S. Humphray | Z. Kingsbury | R. Shaw | D. Housman | A. Al-Chalabi | O. Hardiman | Ashley Jones | R. McLaughlin | A. Shatunov | Bradley N Smith | L. H. van den Berg | M. V. van Es | J. Veldink | B. Lajoie | R. Taft | M. Bekritsky | C. Reeves | G. Narzisi | N. Wexler | K. Morrison | P. Shaw | M. Baker | A. Pittman | R. Rademakers | G. Tazelaar | W. Sproviero | W. Brands | M. Kooyman | E. Dolzhenko | J. V. van Vugt | Matt Baker | Rosa Rademakers | Pamela J Shaw | Zoya Kingsbury | Bradley Smith | Marka van Blitterswijk | Sean J Humphray | Jan H Veldink | A. Al Khleifat | Vani Rajan | M. van Blitterswijk | Nathan H Johnson | Sarah Morgan | Chris E. Shaw | E. J. Neo | L. Winterkorn | C. Ng | Alina L. Li | Ahmad Al Khleifat | Maarten Kooyman | Gijs H. P. Tazelaar | Nancy S Wexler | Subramanian S Ajay | Catherine Reeves | David E Housman | Richard J Shaw | Mitchell A Bekritsky | Raymond D Schellevis | William J Brands | Gijs H P Tazelaar | Michael A van Es | Russell McLaughlin | Ashley Jones | Alan Pittman | Sarah Morgan | Edmund J Neo | Karen Morrison | Lara Winterkorn | Alina L Li | Bradley N. Smith | R. Rademakers | Nathan H. Johnson | Richard Shaw | S. Ajay | P. Shaw | R. Mclaughlin

[1]  P. Gonzalez-Alegre,et al.  Towards precision medicine , 2017 .

[2]  Ergude Bao,et al.  HALC: High throughput algorithm for long read error correction , 2017, BMC Bioinformatics.

[3]  E. Ashley Towards precision medicine , 2016, Nature Reviews Genetics.

[4]  G. McVean,et al.  A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree , 2016, bioRxiv.

[5]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[6]  Niranjan Nagarajan,et al.  Fast and sensitive mapping of nanopore sequencing reads with GraphMap , 2016, Nature Communications.

[7]  A. Durr,et al.  CAG repeat size in Huntingtin alleles is associated with cancer prognosis , 2016, European Journal of Human Genetics.

[8]  C. Broeckhoven,et al.  The C9orf72 repeat size correlates with onset age of disease, DNA methylation and transcriptional downregulation of the promoter , 2015, Molecular Psychiatry.

[9]  Rémy Bruggmann,et al.  Clinical sequencing: is WGS the better WES? , 2016, Human Genetics.

[10]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[11]  Vivien Marx,et al.  The DNA of a nation , 2015, Nature.

[12]  Euan A Ashley,et al.  The precision medicine initiative: a new national effort. , 2015, JAMA.

[13]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[14]  Michael C. Schatz,et al.  The Challenge of Small-Scale Repeats for Indel Discovery , 2015, Front. Bioeng. Biotechnol..

[15]  Richard Durbin,et al.  Extending reference assembly models , 2015, Genome Biology.

[16]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[17]  C. van Broeckhoven,et al.  A blinded international study on the reliability of genetic testing for GGGGCC-repeat expansions in C9orf72 reveals marked differences in results among 14 laboratories , 2014, Journal of Medical Genetics.

[18]  Semyon Kruglyak,et al.  Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms , 2013, Bioinform..

[19]  A. Higginbottom,et al.  Simultaneous and independent detection of C9ORF72 alleles with low and high number of GGGGCC repeats using an optimised protocol of Southern blot hybridisation , 2013, Molecular Neurodegeneration.

[20]  Sarah McCalmon,et al.  Sequencing the unsequenceable: Expanded CGG-repeat alleles of the fragile X gene , 2013, Genome research.

[21]  F. Jessen,et al.  A Pan-European Study of the C9orf72 Repeat Associated with FTLD: Geographic Prevalence, Genomic Instability, and Intermediate Repeats , 2012, Human mutation.

[22]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[23]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[24]  S. Pereson,et al.  A C9orf72 promoter repeat expansion in a Flanders-Belgian cohort with disorders of the frontotemporal lobar degeneration-amyotrophic lateral sclerosis spectrum: a gene identification study , 2012, The Lancet Neurology.

[25]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[26]  Bruce L. Miller,et al.  Expanded GGGGCC Hexanucleotide Repeat in Noncoding Region of C9ORF72 Causes Chromosome 9p-Linked FTD and ALS , 2011, Neuron.

[27]  David Heckerman,et al.  A Hexanucleotide Repeat Expansion in C9ORF72 Is the Cause of Chromosome 9p21-Linked ALS-FTD , 2011, Neuron.

[28]  Weimin Sun,et al.  FMR1 premutation carrier frequency in patients undergoing routine population-based carrier screening: Insights into the prevalence of fragile X syndrome, fragile X-associated tremor/ataxia syndrome, and fragile X-associated primary ovarian insufficiency in the United States , 2011, Genetics in Medicine.

[29]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[30]  C. McMurray Mechanisms of trinucleotide repeat instability during human development , 2010, Nature Reviews Genetics.

[31]  C. McMurray Mechanisms of trinucleotide repeat instability during human development , 2010, Nature Reviews Genetics.

[32]  K. Christodoulou,et al.  High frequency of Friedreich's ataxia carriers in the Paphos district of Cyprus. , 2009, Acta myologica : myopathies and cardiomyopathies : official journal of the Mediterranean Society of Myology.

[33]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[34]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[35]  E. Spector,et al.  Clinical significance of tri-nucleotide repeats in Fragile X testing: A clarification of American College of Medical Genetics guidelines , 2008, Genetics in Medicine.

[36]  Huda Y. Zoghbi,et al.  Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles , 2005, Nature Reviews Genetics.

[37]  Doree Sitkoff,et al.  models homology modeling : From sequence alignments to structural A comparative study of available software for high-accuracy , 2005 .

[38]  Karen Marder,et al.  Venezuelan kindreds reveal that genetic and environmental factors modulate Huntington's disease age of onset. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[39]  A Dürr,et al.  Clinical and genetic abnormalities in patients with Friedreich's ataxia. , 1996, The New England journal of medicine.

[40]  J. Rice Mathematical Statistics and Data Analysis , 1988 .