MIPSTR: a method for multiplex genotyping of germline and somatic STR variation across many individuals

Short tandem repeats (STRs) are highly mutable genetic elements that often reside in regulatory and coding DNA. The cumulative evidence of genetic studies on individual STRs suggests that STR variation profoundly affects phenotype and contributes to trait heritability. Despite recent advances in sequencing technology, STR variation has remained largely inaccessible across many individuals compared to single nucleotide variation or copy number variation. STR genotyping with short-read sequence data is confounded by (1) the difficulty of uniquely mapping short, low-complexity reads; and (2) the high rate of STR amplification stutter. Here, we present MIPSTR, a robust, scalable, and affordable method that addresses these challenges. MIPSTR uses targeted capture of STR loci by single-molecule Molecular Inversion Probes (smMIPs) and a unique mapping strategy. Targeted capture and our mapping strategy resolve the first challenge; the use of single molecule information resolves the second challenge. Unlike previous methods, MIPSTR is capable of distinguishing technical error due to amplification stutter from somatic STR mutations. In proof-of-principle experiments, we use MIPSTR to determine germline STR genotypes for 102 STR loci with high accuracy across diverse populations of the plant A. thaliana. We show that putatively functional STRs may be identified by deviation from predicted STR variation and by association with quantitative phenotypes. Using DNA mixing experiments and a mutant deficient in DNA repair, we demonstrate that MIPSTR can detect low-frequency somatic STR variants. MIPSTR is applicable to any organism with a high-quality reference genome and is scalable to genotyping many thousands of STR loci in thousands of individuals.

[1]  Bjarni J. Vilhjálmsson,et al.  Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines , 2010 .

[2]  Pardis C Sabeti,et al.  Positive Selection of a Pre-Expansion CAG Repeat of the Human SCA2 Gene , 2005, PLoS genetics.

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  S. Girirajan,et al.  Lessons from Model Organisms: Phenotypic Robustness and Missing Heritability in Complex Disease , 2012, PLoS genetics.

[5]  Michael Wigler,et al.  Genome-wide copy number analysis of single cells , 2012, Nature Protocols.

[6]  Kevin R. Thornton,et al.  The Drosophila melanogaster Genetic Reference Panel , 2012, Nature.

[7]  M. Kirkpatrick,et al.  Evolution of a genetic incompatibility in the genus Xiphophorus. , 2013, Molecular biology and evolution.

[8]  Vipin T. Sreedharan,et al.  Multiple reference genomes and transcriptomes for Arabidopsis thaliana , 2011, Nature.

[9]  C. Walsh,et al.  Somatic Mutation, Genomic Variation, and Neurological Disease , 2013, Science.

[10]  G. Highnam,et al.  Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles , 2012, Nucleic acids research.

[11]  Karsten M. Borgwardt,et al.  Whole-genome sequencing of multiple Arabidopsis thaliana populations , 2011, Nature Genetics.

[12]  M. Todesco,et al.  A Genetic Defect Caused by a Triplet Repeat Expansion in Arabidopsis thaliana , 2009, Science.

[13]  J. Troge,et al.  Tumour evolution inferred by single-cell sequencing , 2011, Nature.

[14]  M. Rosbash,et al.  Molecular coevolution within a Drosophila clock gene. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Peter J. Park,et al.  The Landscape of Microsatellite Instability in Colorectal and Endometrial Cancer Genomes , 2013, Cell.

[16]  L. Loeb,et al.  Do mutator mutations fuel tumorigenesis? , 2013, Cancer and Metastasis Reviews.

[17]  Akash Kumar,et al.  MIPgen: optimized modeling and design of molecular inversion probes for targeted resequencing , 2014, Bioinform..

[18]  Shane J. Neph,et al.  Mapping and dynamics of regulatory DNA and transcription factor networks in A. thaliana. , 2014, Cell reports.

[19]  Fran Lewitter,et al.  Intragenic tandem repeats generate functional variability , 2005, Nature Genetics.

[20]  Jay Shendure,et al.  Massively parallel exon capture and library-free resequencing across 16 genomes , 2009, Nature Methods.

[21]  Mattias Jakobsson,et al.  The Pattern of Polymorphism in Arabidopsis thaliana , 2005, PLoS biology.

[22]  S Srivastava,et al.  A National Cancer Institute Workshop on Microsatellite Instability for cancer detection and familial predisposition: development of international criteria for the determination of microsatellite instability in colorectal cancer. , 1998, Cancer research.

[23]  Yaniv Erlich,et al.  The landscape of human STR variation , 2014, bioRxiv.

[24]  Detlef Weigel,et al.  Recombination and linkage disequilibrium in Arabidopsis thaliana , 2007, Nature Genetics.

[25]  B. Hayes,et al.  Overview of Statistical Methods for Genome-Wide Association Studies (GWAS). , 2013, Methods in molecular biology.

[26]  H. Heng Missing heritability and stochastic genome alterations , 2010, Nature Reviews Genetics.

[27]  L. A. Sawyer,et al.  Natural variation in a Drosophila clock gene and temperature compensation. , 1997, Science.

[28]  Rodney J Scott,et al.  STaRRRT: a table of short tandem repeats in regulatory regions of the human genome , 2013, BMC Genomics.

[29]  Larry J Young,et al.  Microsatellite Instability Generates Diversity in Brain and Sociobehavioral Traits , 2005, Science.

[30]  K. Eckert,et al.  Every microsatellite is different: Intrinsic DNA features dictate mutagenesis of common microsatellites present in the human genome , 2009, Molecular carcinogenesis.

[31]  Jay Shendure,et al.  Single molecule molecular inversion probes for targeted, high-accuracy detection of low-frequency variation , 2013, Genome research.

[32]  Roderic Guigó,et al.  Mutation patterns of amino acid tandem repeats in the human proteome , 2006, Genome Biology.

[33]  Inanç Birol,et al.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing , 2010, Bioinform..

[34]  A. Golubov,et al.  Microsatellite Instability in Arabidopsis Increases with Plant Development1[W][OA] , 2010, Plant Physiology.

[35]  H. Garner,et al.  Molecular origins of rapid and continuous morphological evolution , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[36]  A. Jansen,et al.  Large-scale analysis of tandem repeat variability in the human genome , 2014, Nucleic acids research.

[37]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[38]  Richard J. Edwards,et al.  Tandem repeat copy-number variation in protein-coding regions of human genes , 2005, Genome Biology.

[39]  L. Loeb,et al.  Implications of genetic heterogeneity in cancer , 2012, Annals of the New York Academy of Sciences.

[40]  Fangqing Zhao,et al.  inGAP-sv: a novel scheme to identify and visualize structural variation from paired end mapping data , 2011, Nucleic Acids Res..

[41]  K. Borgwardt,et al.  Accurate indel prediction using paired-end short reads , 2013, BMC Genomics.

[42]  C. Queitsch,et al.  The overdue promise of short tandem repeat variation for heritability , 2014, bioRxiv.

[43]  A. Delcher,et al.  Triplet repeat length bias and variation in the human transcriptome , 2009, Proceedings of the National Academy of Sciences.

[44]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[45]  A. Sharp,et al.  Rapid Multiplexed Genotyping of Simple Tandem Repeats using Capture and High‐Throughput Sequencing , 2013, Human Mutation.

[46]  Bradley P. Coe,et al.  Multiplex Targeted Sequencing Identifies Recurrently Mutated Genes in Autism Spectrum Disorders , 2012, Science.

[47]  Huda Y. Zoghbi,et al.  Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles , 2005, Nature Reviews Genetics.

[48]  C. R. McClung,et al.  Variation in Arabidopsis flowering time associated with cis-regulatory variation in CONSTANS , 2014, Nature Communications.

[49]  T. Michael,et al.  Simple Sequence Repeats Provide a Substrate for Phenotypic Variation in the Neurospora crassa Circadian Clock , 2007, PloS one.

[50]  Hongseok Tae,et al.  ReviSTER: an automated pipeline to revise misaligned reads to simple tandem repeats , 2013, Bioinform..

[51]  A. P. Butler,et al.  Regulation of CDKN2A/B and Retinoblastoma genes in Xiphophorus melanoma. , 2007, Comparative biochemistry and physiology. Toxicology & pharmacology : CBP.

[52]  K. Verstrepen,et al.  Background-dependent effects of polyglutamine variation in the Arabidopsis thaliana gene ELF3 , 2012, Proceedings of the National Academy of Sciences.

[53]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[54]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[55]  V. Sheeba The Drosophila melanogaster circadian pacemaker circuit , 2008, Journal of Genetics.

[56]  A. Hannan,et al.  Dynamic mutations as digital genetic modulators of brain development, function and dysfunction , 2007, BioEssays : news and reviews in molecular, cellular and developmental biology.

[57]  N. Pochet,et al.  Sequence-based estimation of minisatellite and microsatellite repeat variability. , 2007, Genome research.

[58]  K. Verstrepen,et al.  Beyond Junk-Variable Tandem Repeats as Facilitators of Rapid Evolution of Regulatory and Coding Sequences , 2012, Genes.

[59]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[60]  Jay Shendure,et al.  Multiplex amplification of large sets of human exons , 2007, Nature Methods.

[61]  Minh Duc Cao,et al.  Inferring short tandem repeat variation from paired-end short reads , 2013, Nucleic acids research.

[62]  Muhammad Ali Amer,et al.  Genome-wide association study of 107 phenotypes in a common set of Arabidopsis thaliana inbred lines , 2010, Nature.