Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing

Motivation: Long expansions of short tandem repeats (STRs), i.e. DNA repeats of 2–6 nt, are associated with some genetic diseases. Cost-efficient high-throughput sequencing can quickly produce billions of short reads that would be useful for uncovering disease-associated STRs. However, enumerating STRs in short reads remains largely unexplored because of the difficulty in elucidating STRs much longer than 100 bp, the typical length of short reads. Results: We propose ab initio procedures for sensing and locating long STRs promptly by using the frequency distribution of all STRs and paired-end read information. We validated the reproducibility of this method using biological replicates and used it to locate an STR associated with a brain disease (SCA31). Subsequently, we sequenced this STR site in 11 SCA31 samples using SMRTTM sequencing (Pacific Biosciences), determined 2.3–3.1 kb sequences at nucleotide resolution and revealed that (TGGAA)- and (TAAAATAGAA)-repeat expansions determined the instability of the repeat expansions associated with SCA31. Our method could also identify common STRs, (AAAG)- and (AAAAG)-repeat expansions, which are remarkably expanded at four positions in an SCA31 sample. This is the first proposed method for rapidly finding disease-associated long STRs in personal genomes using hybrid sequencing of short and long reads. Availability and implementation: Our TRhist software is available at http://trhist.gi.k.u-tokyo.ac.jp/. Contact: moris@cb.k.u-tokyo.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Chee Keong Kwoh,et al.  Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance , 2013, Briefings Bioinform..

[2]  Marzena Wojciechowska,et al.  Cellular toxicity of expanded RNA repeats: focus on RNA foci , 2011, Human molecular genetics.

[3]  Yuko Saito,et al.  Spinocerebellar ataxia type 31 is associated with "inserted" penta-nucleotide repeats containing (TGGAA)n. , 2009, American journal of human genetics.

[4]  J. Sutcliffe,et al.  Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome , 1991, Cell.

[5]  K. Fischbeck,et al.  Androgen receptor gene mutations in X-linked spinal and bulbar muscular atrophy , 1991, Nature.

[6]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[7]  D. Goudie,et al.  A general method for the detection of large CAG repeat expansions by fluorescent PCR. , 1996, Journal of medical genetics.

[8]  Koji Abe,et al.  Expansion of intronic GGCCTG hexanucleotide repeat in NOP56 causes SCA36, a type of spinocerebellar ataxia accompanied by motor neuron involvement. , 2011, American journal of human genetics.

[9]  Alessio Vecchio,et al.  TRStalker: an efficient heuristic for finding fuzzy tandem repeats , 2010, Bioinform..

[10]  Franco P. Preparata,et al.  A Novel Approach to the Detection of Genomic Approximate Tandem Repeats in the Levenshtein Metric , 2007, J. Comput. Biol..

[11]  R. Moyzis,et al.  Highly conserved repetitive DNA sequences are present at human centromeres. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Andrey V. Kajava,et al.  T-REKS: identification of Tandem REpeats in sequences with a K-meanS based algorithm , 2009, Bioinform..

[13]  Dan Geiger,et al.  Finding approximate tandem repeats in genomic sequences , 2004, RECOMB.

[14]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[15]  M. DePristo,et al.  Variation in genome-wide mutation rates within and between human families , 2011, Nature Genetics.

[16]  S. Turner,et al.  Real-time DNA sequencing from single polymerase molecules. , 2010, Methods in enzymology.

[17]  Bruce L. Miller,et al.  Expanded GGGGCC Hexanucleotide Repeat in Noncoding Region of C9ORF72 Causes Chromosome 9p-Linked FTD and ALS , 2011, Neuron.

[18]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[19]  J. Lupski,et al.  Genomic rearrangements and sporadic disease , 2007, Nature Genetics.

[20]  S. Naylor,et al.  Myotonic Dystrophy Type 2 Caused by a CCTG Expansion in Intron 1 of ZNF9 , 2001, Science.

[21]  Takanori Yamagata,et al.  Large expansion of the ATTCT pentanucleotide repeat in spinocerebellar ataxia type 10 , 2000, Nature Genetics.

[22]  David Heckerman,et al.  A Hexanucleotide Repeat Expansion in C9ORF72 Is the Cause of Chromosome 9p21-Linked ALS-FTD , 2011, Neuron.

[23]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[24]  Sarah McCalmon,et al.  Sequencing the unsequenceable: Expanded CGG-repeat alleles of the fragile X gene , 2013, Genome research.

[25]  Michael G. Main,et al.  Detecting leftmost maximal periodicities , 1989, Discret. Appl. Math..

[26]  N. E. Morton,et al.  Further segregation analysis of the fragile X syndrome with special reference to transmitting males , 2004, Human Genetics.

[27]  C. Amemiya,et al.  Myotonic dystrophy mutation: an unstable CTG repeat in the 3' untranslated region of the gene. , 1992, Science.

[28]  S. Mirkin Expandable DNA repeats and human disease , 2007, Nature.

[29]  Harry T Orr,et al.  FTD and ALS: Genetic Ties that Bind , 2011, Neuron.

[30]  David E. Housman,et al.  Molecular basis of myotonic dystrophy: Expansion of a trinucleotide (CTG) repeat at the 3′ end of a transcript encoding a protein kinase family member , 1992, Cell.

[31]  Hampapathalu A. Nagarajaram,et al.  Genome analysis IMEx : Imperfect Microsatellite Extractor , 2007 .

[32]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[33]  P. de Knijff,et al.  Mutability of Y-chromosomal microsatellites: rates, characteristics, molecular bases, and forensic implications. , 2010, American journal of human genetics.

[34]  R I Richards,et al.  Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n , 1991, Science.

[35]  J. Cummings,et al.  Huntington's disease. , 1997, The Psychiatric clinics of North America.

[36]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[37]  Manish S. Shah,et al.  A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington's disease chromosomes , 1993, Cell.