Accurate typing of short tandem repeats from genome-wide sequencing data and its applications

Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.

[1]  V. Murray,et al.  The determination of the sequences present in the shadow bands of a dinucleotide repeat PCR. , 1993, Nucleic acids research.

[2]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[3]  Chee Keong Kwoh,et al.  Review of tandem repeat search tools: a systematic approach to evaluating algorithmic performance , 2013, Briefings Bioinform..

[4]  D. Kwiatkowski,et al.  Optimizing illumina next-generation sequencing library preparation for extremely at-biased genomes , 2012, BMC Genomics.

[5]  Anton Nekrutenko,et al.  Dissemination of scientific software with Galaxy ToolShed , 2014, Genome Biology.

[6]  K. Kim,et al.  Microsatellite data analysis for population genetics. , 2013, Methods in molecular biology.

[7]  H R Garner,et al.  Evaluation of microsatellite variation in the 1000 Genomes Project pilot studies is indicative of the quality and utility of the raw data and alignments. , 2011, Genomics.

[8]  Mohd Y. Rafii,et al.  A Review of Microsatellite Markers and Their Applications in Rice Breeding Programs to Improve Blast Disease Resistance , 2013, International journal of molecular sciences.

[9]  Kateryna D. Makova,et al.  Distinct Mutational Behaviors Differentiate Short Tandem Repeats from Microsatellites in the Human Genome , 2012, Genome biology and evolution.

[10]  Swapan Mallick,et al.  A direct characterization of human mutation based on microsatellites , 2012, Nature Genetics.

[11]  P. Marynen,et al.  A child, homozygous for a stop codon in exon 11, shows milder cystic fibrosis symptoms than her heterozygous nephew , 1990, Journal of medical genetics.

[12]  B. Frey,et al.  Demonstration of the Expand TM PCR System's Greater Fidelity and Higher Yields with a lacI-based PCR Fidelity Assay , 2000 .

[13]  Ryan J. Haasl,et al.  A genomic portrait of human microsatellite variation. , 2011, Molecular biology and evolution.

[14]  K. Makova,et al.  Microsatellite Interruptions Stabilize Primate Genomes and Exist as Population-Specific Single Nucleotide Polymorphisms within Individual Human Genomes , 2014, PLoS genetics.

[15]  Matthieu Legendre,et al.  Variable tandem repeats accelerate evolution of coding and regulatory sequences. , 2010, Annual review of genetics.

[16]  Yaniv Erlich,et al.  The landscape of human STR variation , 2014, bioRxiv.

[17]  Dennis Y. Wang,et al.  Development and Validation of the AmpFℓSTR® Identifiler® Direct PCR Amplification Kit: A Multiplex Assay for the Direct Amplification of Single‐Source Samples *,† , 2011, Journal of forensic sciences.

[18]  Evan E Eichler,et al.  Properties and rates of germline mutations in humans. , 2013, Trends in genetics : TIG.

[19]  Alessio Vecchio,et al.  TRStalker: an efficient heuristic for finding fuzzy tandem repeats , 2010, Bioinform..

[20]  K. Eckert,et al.  DNA polymerase kappa microsatellite synthesis: Two distinct mechanisms of slippage‐mediated errors , 2012, Environmental and molecular mutagenesis.

[21]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[22]  G. Highnam,et al.  Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles , 2012, Nucleic acids research.

[23]  R. Varshney,et al.  The development and use of microsatellite markers for genetic analysis and plant breeding with emphasis on bread wheat , 2000, Euphytica.

[24]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[25]  C. E. Pearson,et al.  Repeat instability as the basis for human diseases and as a potential target for therapy , 2010, Nature Reviews Molecular Cell Biology.

[26]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[27]  K. A. Eckert,et al.  High fidelity DNA synthesis by the Thermus aquaticus DNA polymerase , 1990, Nucleic Acids Res..

[28]  Stephen C. J. Parker,et al.  Accurate and comprehensive sequencing of personal genomes. , 2011, Genome research.

[29]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[30]  S. Tyekucheva,et al.  The genome-wide determinants of human and chimpanzee microsatellite evolution. , 2007, Genome research.

[31]  Q. Wan,et al.  Which genetic marker for which conservation genetics issue? , 2004, Electrophoresis.

[32]  E. Nevo,et al.  Microsatellites within genes: structure, function, and evolution. , 2004, Molecular biology and evolution.

[33]  C. Millar,et al.  DNA fingerprinting in zoology: past, present, future , 2014, Investigative Genetics.

[34]  P. Bentzen,et al.  Microsatellites: genetic markers for the future , 1994, Reviews in Fish Biology and Fisheries.

[35]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[36]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[37]  G. Levinson,et al.  High frequencies of short frameshifts in poly-CA/TG tandem repeats borne by bacteriophage M13 in Escherichia coli K-12 , 1987, Nucleic Acids Res..

[38]  Avinash Kewalramani,et al.  Abundance, Distribution, and Mutation Rates of Homopolymeric Nucleotide Runs in the Genome of Caenorhabditis elegans , 2004, Journal of Molecular Evolution.

[39]  Kateryna D. Makova,et al.  What Is a Microsatellite: A Computational and Experimental Definition Based upon Repeat Mutational Behavior at A/T and GT/AC Repeats , 2010, Genome biology and evolution.

[40]  Hugo Y. K. Lam,et al.  Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes , 2012, Cell.

[41]  P. Taberlet,et al.  Genotyping errors: causes, consequences and solutions , 2005, Nature Reviews Genetics.

[42]  P. Sunnucks,et al.  Efficient genetic markers for population biology. , 2000, Trends in ecology & evolution.

[43]  T. Kunkel,et al.  The in vitro fidelity of yeast DNA polymerase δ and polymerase ε holoenzymes during dinucleotide microsatellite DNA synthesis. , 2011, DNA repair.

[44]  Dobrila D. Rudnicki,et al.  An Antisense CAG Repeat Transcript at JPH3 Locus Mediates Expanded Polyglutamine Protein Toxicity in Huntington's Disease-like 2 Mice , 2011, Neuron.

[45]  K. Eckert,et al.  Somatic mutation rates and specificities at TC/AG and GT/CA microsatellite sequences in nontumorigenic human lymphoblastoid cells. , 2000, Cancer research.

[46]  Z. Ning,et al.  Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of GC-biased genomes , 2009, Nature Methods.

[47]  Francesca Chiaromonte,et al.  A genome-wide analysis of common fragile sites: What features determine chromosomal instability in the human genome? , 2012, Genome research.

[48]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[49]  K. Eckert,et al.  Misalignment-mediated DNA polymerase beta mutations: comparison of microsatellite and frame-shift error rates using a forward mutation assay. , 2002, Biochemistry.

[50]  P. M. Abdul-Muneer Application of Microsatellite Markers in Conservation Genetics and Fisheries Management: Recent Advances in Population Structure Analysis and Conservation Strategies , 2014, Genetics research international.

[51]  P. Walsh,et al.  Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus vWA. , 1996, Nucleic acids research.

[52]  C. E. Pearson,et al.  Repeat instability: mechanisms of dynamic mutations , 2005, Nature Reviews Genetics.

[53]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[54]  C. Dekker,et al.  DNA sequencing with nanopores , 2012, Nature Biotechnology.

[55]  N. Kyrpides,et al.  Direct Comparisons of Illumina vs. Roche 454 Sequencing Technologies on the Same Microbial Community DNA Sample , 2012, PloS one.

[56]  K. Makova,et al.  Mature Microsatellites: Mechanisms Underlying Dinucleotide Microsatellite Mutational Biases in Human Cells , 2013, G3: Genes, Genomes, Genetics.

[57]  Mark Gerstein,et al.  The origin, evolution, and functional impact of short insertion–deletion variants identified in 179 human genomes , 2013, Genome research.

[58]  H. Ellegren Microsatellites: simple sequences with complex evolution , 2004, Nature Reviews Genetics.

[59]  J. Rommens,et al.  Short GCG expansions in the PABP2 gene cause oculopharyngeal muscular dystrophy , 1998, Nature Genetics.

[60]  H. Ostrer,et al.  Familial colorectal cancer in Ashkenazim due to a hypermutable tract in APC , 1997, Nature Genetics.

[61]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[62]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[63]  Fengzhu Sun,et al.  Taq DNA polymerase slippage mutation rates measured by PCR and quasi-likelihood analysis: (CA/GT)n and (A/T)n microsatellites. , 2003, Nucleic acids research.

[64]  S. Gallinger,et al.  Inherited colorectal polyposis and cancer risk of the APC I1307K polymorphism. , 1999, American journal of human genetics.

[65]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[66]  Anton Nekrutenko,et al.  Maternal age effect and severe germ-line bottleneck in the inheritance of human mitochondrial DNA , 2014, Proceedings of the National Academy of Sciences.

[67]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[68]  R. Durbin,et al.  Dindel: accurate indel calls from short-read data. , 2011, Genome research.

[69]  Mauricio O. Carneiro,et al.  The advantages of SMRT sequencing , 2013, Genome Biology.

[70]  P. Jarne,et al.  Microsatellites, from molecules to populations and back. , 1996, Trends in ecology & evolution.

[71]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[72]  Ian C. Gray,et al.  Identification of the skeletal remains of a murder victim by DNA analysis , 1991, Nature.

[73]  J. Moss,et al.  Variable deletion of exon 9 coding sequences in cystic fibrosis transmembrane conductance regulator gene mRNA transcripts in normal bronchial epithelium. , 1991, The EMBO journal.