TSSV: a tool for characterization of complex allelic variants in pure and mixed genomes

MOTIVATION Advances in sequencing technologies and computational algorithms have enabled the study of genomic variants to dissect their functional consequence. Despite this unprecedented progress, current tools fail to reliably detect and characterize more complex allelic variants, such as short tandem repeats (STRs). We developed TSSV as an efficient and sensitive tool to specifically profile all allelic variants present in targeted loci. Based on its design, requiring only two short flanking sequences, TSSV can work without the use of a complete reference sequence to reliably profile highly polymorphic, repetitive or uncharacterized regions. RESULTS We show that TSSV can accurately determine allelic STR structures in mixtures with 10% representation of minor alleles or complex mixtures in which a single STR allele is shared. Furthermore, we show the universal utility of TSSV in two other independent studies: characterizing de novo mutations introduced by transcription activator-like effector nucleases (TALENs) and profiling the noise and systematic errors in an IonTorrent sequencing experiment. TSSV complements the existing tools by aiding the study of highly polymorphic and complex regions and provides a high-resolution map that can be used in a wide range of applications, from personal genomics to forensic analysis and clinical diagnostics. AVAILABILITY AND IMPLEMENTATION We have implemented TSSV as a Python package that can be installed through the command-line using pip install TSSV command. Its source code and documentation are available at https://pypi.python.org/pypi/tssv and http://www.lgtc.nl/tssv.

[1]  Joshua M. Korn,et al.  Mapping and sequencing of structural variation from eight human genomes , 2008, Nature.

[2]  Peter H. Sudmant,et al.  Diversity of Human Copy Number Variation and Multicopy Genes , 2010, Science.

[3]  J. Weber,et al.  Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction. , 1989, American journal of human genetics.

[4]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[5]  C. E. Pearson,et al.  Repeat instability: mechanisms of dynamic mutations , 2005, Nature Reviews Genetics.

[6]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[7]  R. Vossen,et al.  Generation and Characterization of Transgenic Mice with the Full-length Human DMD Gene* , 2008, Journal of Biological Chemistry.

[8]  S. Verbeek,et al.  Generation of Embryonic Stem Cells and Mice for Duchenne Research , 2013, PLoS currents.

[9]  Gonçalo Abecasis,et al.  Deletion of the late cornified envelope LCE3B and LCE3C genes as a susceptibility factor for psoriasis , 2009, Nature Genetics.

[10]  C. Amemiya,et al.  Myotonic dystrophy mutation: an unstable CTG repeat in the 3' untranslated region of the gene. , 1992, Science.

[11]  Gary D Bader,et al.  Functional impact of global rare copy number variation in autism spectrum disorders , 2010, Nature.

[12]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[13]  Erin L. Doyle,et al.  Efficient design and assembly of custom TALEN and other TAL effector-based constructs for DNA targeting , 2011, Nucleic acids research.

[14]  Carlos S. Moreno,et al.  Relative Burden of Large CNVs on a Range of Neurodevelopmental Phenotypes , 2011, PLoS genetics.

[15]  Manfred Kayser,et al.  Improving human forensics through advances in genetics, genomics and molecular biology , 2011, Nature Reviews Genetics.

[16]  Jens Boch,et al.  TALEs of genome targeting , 2011, Nature Biotechnology.

[17]  R. Wells,et al.  Hairpin Structure-forming Propensity of the (CCTG·CAGG) Tetranucleotide Repeats Contributes to the Genetic Instability Associated with Myotonic Dystrophy Type 2* , 2004, Journal of Biological Chemistry.

[18]  M. Litt,et al.  A study of the origin of 'shadow bands' seen when typing dinucleotide repeat polymorphisms by the PCR. , 1993, Human molecular genetics.

[19]  H. Ellegren Microsatellites: simple sequences with complex evolution , 2004, Nature Reviews Genetics.

[20]  R I Richards,et al.  Simple tandem DNA repeats and human genetic disease. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[22]  Jiyeon Kweon,et al.  TALENs and ZFNs are associated with different mutation signatures , 2013, Nature Methods.

[23]  Tomas W. Fitzgerald,et al.  Origins and functional impact of copy number variation in the human genome , 2010, Nature.

[24]  André Reis,et al.  Psoriasis is associated with increased beta-defensin genomic copy number. , 2008, Nature genetics.

[25]  B Budowle,et al.  Validation of short tandem repeats (STRs) for forensic usage: performance testing of fluorescent multiplex STR systems and analysis of authentic and simulated forensic samples. , 2001, Journal of forensic sciences.

[26]  Bruce Budowle,et al.  STRait Razor: a length-based forensic STR allele-calling tool for use with second generation sequencing data. , 2013, Forensic science international. Genetics.

[27]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[28]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[29]  S. Mirkin Expandable DNA repeats and human disease , 2007, Nature.

[30]  Kevin P. Murphy,et al.  SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors , 2010, Bioinform..

[31]  N. Carter,et al.  Massive Genomic Rearrangement Acquired in a Single Catastrophic Event during Cancer Development , 2011, Cell.

[32]  Rapid Variable-Number Tandem-Repeat Genotyping for Mycobacterium leprae Clinical Specimens , 2009, Journal of Clinical Microbiology.

[33]  J. Sutcliffe,et al.  Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome , 1991, Cell.

[34]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[35]  C. E. Pearson,et al.  Slipped-strand DNAs formed by long (CAG)*(CTG) repeats: slipped-out repeats and slip-out junctions. , 2002, Nucleic acids research.

[36]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2012, Nature Reviews Genetics.

[37]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[38]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[39]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[40]  G. Church,et al.  Efficient construction of sequence-specific TAL effectors for modulating mammalian transcription. , 2011, Nature biotechnology.

[41]  David E. Housman,et al.  Molecular basis of myotonic dystrophy: Expansion of a trinucleotide (CTG) repeat at the 3′ end of a transcript encoding a protein kinase family member , 1992, Cell.

[42]  G. Highnam,et al.  Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles , 2012, Nucleic acids research.

[43]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[44]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[45]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[46]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[47]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[48]  R I Richards,et al.  Mapping of DNA instability at the fragile X to a trinucleotide repeat sequence p(CCG)n , 1991, Science.

[49]  S. Mccarroll,et al.  Donor-recipient mismatch for common gene deletion polymorphisms in graft-versus-host disease , 2009, Nature Genetics.

[50]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[51]  Trevor J Pugh,et al.  Discovery and characterization of artifactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation , 2013, Nucleic acids research.

[52]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[53]  K. Frazer,et al.  Common deletions and SNPs are in linkage disequilibrium in the human genome , 2006, Nature Genetics.

[54]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[55]  Johan T den Dunnen,et al.  Improving sequence variant descriptions in mutation databases and literature using the Mutalyzer sequence variation nomenclature checker , 2008, Human mutation.

[56]  Jan O. Korbel,et al.  Phenotypic impact of genomic structural variation: insights from and for human disease , 2013, Nature Reviews Genetics.

[57]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[58]  Elo Leung,et al.  A TALE nuclease architecture for efficient genome editing , 2011, Nature Biotechnology.