TRiCoLOR: tandem repeat profiling using whole-genome long-read sequencing data

Abstract Background Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution. Results We developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. Conclusions TRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.

[1]  Ashley Sanders,et al.  VISOR: a versatile haplotype-aware structural variant simulator for short- and long-read sequencing , 2019, Bioinform..

[2]  K. Sleegers,et al.  NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION , 2019, Genome Biology.

[3]  V. Bansal,et al.  Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing , 2019, Nature Communications.

[4]  P. O’Reilly,et al.  Evolutionary and functional impact of common polymorphic inversions in the human genome , 2019, Nature Communications.

[5]  Yi Xing,et al.  TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain , 2019, Bioinform..

[6]  Alexander Hoischen,et al.  Long-Read Sequencing Emerging in Medical Genetics , 2019, Front. Genet..

[7]  Martin C. Frith,et al.  Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads , 2019, Genome Biology.

[8]  Davide Bolognini,et al.  NanoR: A user-friendly R package to analyze and compare nanopore sequencing data , 2019, bioRxiv.

[9]  Jan O. Korbel,et al.  Alfred: interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long- and short-read sequencing , 2018, Bioinform..

[10]  Kateryna D. Makova,et al.  Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data , 2018, bioRxiv.

[11]  Benjamin T James,et al.  Look4TRs: A de-novo tool for detecting simple tandem repeats using self-supervised hidden Markov models , 2018, bioRxiv.

[12]  K. Sleegers,et al.  Accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION , 2018, bioRxiv.

[13]  Dmitry Antipov,et al.  Versatile genome assembly evaluation with QUAST-LG , 2018, Bioinform..

[14]  A. Hannan,et al.  Tandem repeats mediating genetic plasticity in health and disease , 2018, Nature Reviews Genetics.

[15]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[16]  H. Paulson Repeat expansion diseases. , 2018, Handbook of clinical neurology.

[17]  Ryan L. Collins,et al.  Multi-platform discovery of haplotype-resolved structural variation in human genomes , 2017, bioRxiv.

[18]  Brent S. Pedersen,et al.  Mosdepth: quick coverage calculation for genomes and exomes , 2017, bioRxiv.

[19]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[20]  Zamin Iqbal,et al.  Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes , 2016, bioRxiv.

[21]  Ali Bashir,et al.  Resolving complex tandem repeats with long reads , 2014, Bioinform..

[22]  Yuji Takahashi,et al.  Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing , 2013, Bioinform..

[23]  G. Highnam,et al.  Accurate human microsatellite genotypes from high-throughput resequencing data using informed error profiles , 2012, Nucleic acids research.

[24]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[25]  S. Rosset,et al.  lobSTR: A short tandem repeat profiler for personal genomes , 2012, RECOMB.

[26]  Tenreiro Machado,et al.  Shannon Entropy Analysis of the Genome Code , 2012 .

[27]  M. Batzer,et al.  Repetitive Elements May Comprise Over Two-Thirds of the Human Genome , 2011, PLoS genetics.

[28]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[29]  B. Dujon,et al.  Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes , 2008, Microbiology and Molecular Biology Reviews.

[30]  Christopher J. Lee Generating Consensus Sequences from Partial Order Multiple Sequence Alignment Graphs , 2003, Bioinform..

[31]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[32]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.