ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions

Abstract Summary We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci. Availability and implementation ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Cheryl Shoubridge,et al.  Polyalanine tract disorders and neurocognitive phenotypes. , 2012, Advances in experimental medicine and biology.

[2]  D. Evans,et al.  A common MSH2 mutation in English and North American HNPCC families: origin, phenotypic expression, and sex specific differences in colorectal cancer , 1999, Journal of medical genetics.

[3]  Matthew S. Lebo,et al.  A Rigorous Interlaboratory Examination of the Need to Confirm Next-Generation Sequencing–Detected Variants with an Orthogonal Method in Clinical Genetic Testing , 2019, The Journal of molecular diagnostics : JMD.

[4]  A. Munnich,et al.  Polyalanine expansion and frameshift mutations of the paired-like homeobox gene PHOX2B in congenital central hypoventilation syndrome , 2003, Nature Genetics.

[5]  Nima Mousavi,et al.  Profiling the genome-wide landscape of tandem repeat expansions , 2018 .

[6]  S. Naylor,et al.  Myotonic Dystrophy Type 2 Caused by a CCTG Expansion in Intron 1 of ZNF9 , 2001, Science.

[7]  Yaniv Erlich,et al.  Abundant contribution of short tandem repeats to gene expression variation in humans , 2015, Nature Genetics.

[8]  Brett Trost,et al.  Length of Uninterrupted CAG, Independent of Polyglutamine Size, Results in Increased Somatic Instability, Hastening Onset of Huntington Disease. , 2019, American journal of human genetics.

[9]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[10]  Belinda Phipson,et al.  STRetch: detecting and discovering pathogenic short tandem repeat expansions , 2018, Genome Biology.

[11]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[12]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[13]  Gil McVean,et al.  Improved genome inference in the MHC using a population reference graph , 2014, Nature Genetics.

[14]  Melanie Bahlo,et al.  Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data. , 2018, American journal of human genetics.

[15]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[16]  Chris Shaw,et al.  Detection of long repeat expansions from PCR-free whole-genome sequence data , 2016, bioRxiv.

[17]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[18]  David Heckerman,et al.  Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes , 2017, American journal of human genetics.

[19]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[20]  William Jones,et al.  Variation graph toolkit improves read mapping by representing genetic variation in the reference , 2018, Nature Biotechnology.

[21]  A. Hannan,et al.  Tandem repeats mediating genetic plasticity in health and disease , 2018, Nature Reviews Genetics.

[22]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[23]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[24]  Jordan M. Eizenga,et al.  Genome graphs and the evolution of genome inference , 2017, bioRxiv.