DINTD: Detection and Inference of Tandem Duplications From Short Sequencing Reads

Tandem duplication (TD) is an important type of structural variation (SV) in the human genome and has biological significance for human cancer evolution and tumor genesis. Accurate and reliable detection of TDs plays an important role in advancing early detection, diagnosis, and treatment of disease. The advent of next-generation sequencing technologies has made it possible for the study of TDs. However, detection is still challenging due to the uneven distribution of reads and the uncertain amplitude of TD regions. In this paper, we present a new method, DINTD (Detection and INference of Tandem Duplications), to detect and infer TDs using short sequencing reads. The major principle of the proposed method is that it first extracts read depth and mapping quality signals, then uses the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to find the possible TD regions. The total variation penalized least squares model is fitted with read depth and mapping quality signals to denoise signals. A 2D binary search tree is used to search the neighbor points effectively. To further identify the exact breakpoints of the TD regions, split-read signals are integrated into DINTD. The experimental results of DINTD on simulated data sets showed that DINTD can outperform other methods for sensitivity, precision, F1-score, and boundary bias. DINTD is further validated on real samples, and the experiment results indicate that it is consistent with other methods. This study indicates that DINTD can be used as an effective tool for detecting TDs.

[1]  Liying Yang,et al.  CONDEL: Detecting Copy Number Variation and Genotyping Deletion Zygosity from Single Tumor Samples Using Sequence Data , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[3]  Pall I. Olason,et al.  TIDDIT, an efficient and comprehensive structural variant caller for massive parallel sequencing data , 2017, F1000Research.

[4]  Philip M. Kim,et al.  Paired-End Mapping Reveals Extensive Structural Variation in the Human Genome , 2007, Science.

[5]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[6]  Z. Weng,et al.  Local sequence assembly reveals a high-resolution profile of somatic structural variations in 97 cancer genomes , 2015, Nucleic acids research.

[7]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[8]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[9]  M. Stratton,et al.  Tandem duplication of chromosomal segments is common in ovarian and breast cancer genomes , 2012, The Journal of pathology.

[10]  Yu-Ping Wang,et al.  CNV-TV: A robust method to discover copy number variation from short sequencing reads , 2013, BMC Bioinformatics.

[11]  Jin Zhang,et al.  An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data , 2012, BMC Bioinformatics.

[12]  Jan Schröder,et al.  Socrates: identification of genomic rearrangements in tumour genomes by re-aligning soft clipped reads , 2014, Bioinform..

[13]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[14]  Robert Clarke,et al.  Genome-wide identification of significant aberrations in cancer genome , 2012, BMC Genomics.

[15]  Michael C. Rusch,et al.  CREST maps somatic structural variation in cancer genomes with base-pair resolution , 2011, Nature Methods.

[16]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[17]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[18]  Ole Schulz-Trieglaff,et al.  Manta: Rapid detection of structural variants and indels for clinical sequencing applications , 2015, bioRxiv.

[19]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[20]  Edison T Liu,et al.  Structural mutations in cancer: mechanistic and functional insights. , 2012, Trends in genetics : TIG.

[21]  Emmanuel Barillot,et al.  SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data , 2010, Bioinform..

[22]  Can Alkan,et al.  Discovery of tandem and interspersed segmental duplications using high-throughput sequencing , 2019, Bioinform..

[23]  O. Gascuel,et al.  The combinatorics of tandem duplication trees. , 2003, Systematic biology.

[24]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[25]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[26]  Martin Dugas,et al.  Robust and exact structural variation detection with paired-end and soft-clipped alignments: SoftSV compared with eight algorithms , 2016, Briefings Bioinform..

[27]  T. LaFramboise,et al.  DB2: a probabilistic approach for accurate detection of tandem duplication breakpoints using paired-end reads , 2014, BMC Genomics.

[28]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[29]  R. Hegele,et al.  Role of DNA copy number variation in dyslipidemias , 2018, Current opinion in lipidology.

[30]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[31]  Saurabh Gupta,et al.  SInC: an accurate and fast error-model based simulator for SNPs, Indels and CNVs coupled with a read generator for short-read sequence data , 2013, BMC Bioinformatics.

[32]  Laurent Condat,et al.  A Direct Algorithm for 1-D Total Variation Denoising , 2013, IEEE Signal Processing Letters.

[33]  Gabor T. Marth,et al.  An integrated map of structural variation in 2,504 human genomes , 2015, Nature.

[34]  Faraz Hach,et al.  Discovery and genotyping of novel sequence insertions in many sequenced individuals , 2017, Bioinform..

[35]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[36]  Yadong Wang,et al.  PRISM: Pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants , 2012, Bioinform..

[37]  Gary Benson,et al.  VNTRseek—a computational tool to detect tandem repeat variants in high-throughput sequencing data , 2014, Nucleic acids research.

[38]  C. Beck,et al.  Structural variant identification and characterization , 2020, Chromosome Research.

[39]  A. Børresen-Dale,et al.  COMPLEX LANDSCAPES OF SOMATIC REARRANGEMENT IN HUMAN BREAST CANCER GENOMES , 2009, Nature.

[40]  Xiguo Yuan,et al.  MFCNV: A New Method to Detect Copy Number Variations From Next-Generation Sequencing Data , 2020, Frontiers in Genetics.

[41]  Thomas Zichner,et al.  DELLY: structural variant discovery by integrated paired-end and split-read analysis , 2012, Bioinform..

[42]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[43]  Tatiana Popova,et al.  Supplementary Methods , 2012, Acta Neuropsychiatrica.

[44]  Ryan M. Layer,et al.  LUMPY: a probabilistic framework for structural variant discovery , 2012, Genome Biology.

[45]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[46]  Xiaoyu Chen,et al.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications , 2016, Bioinform..

[47]  J. Crowley,et al.  Allele-specific copy-number discovery from whole-genome and whole-exome sequencing , 2015, Nucleic acids research.

[48]  H. Blanché,et al.  Whole‐genome sequencing in patients with ciliopathies uncovers a novel recurrent tandem duplication in IFT140 , 2018, Human mutation.

[49]  Peiyong Guan,et al.  Structural variation detection using next-generation sequencing data: A comparative technical review. , 2016, Methods.

[50]  Raymond M. Moore,et al.  SoftSearch: Integration of Multiple Sequence Features to Identify Breakpoints of Structural Variations , 2013, PloS one.

[51]  Liying Yang,et al.  IntSIM: An Integrated Simulator of Next-Generation Sequencing Data , 2017, IEEE Transactions on Biomedical Engineering.

[52]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[53]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[54]  A. Tron,et al.  Concurrent Inhibition of Pim and FLT3 Kinases Enhances Apoptosis of FLT3-ITD Acute Myeloid Leukemia Cells through Increased Mcl-1 Proteasomal Degradation , 2017, Clinical Cancer Research.

[55]  Liying Yang,et al.  CNV_IFTV: An Isolation Forest and Total Variation-Based Detection of CNVs from Short-Read Sequencing Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.