论文信息 - IterCluster: a barcode clustering algorithm for long fragment read analysis

IterCluster: a barcode clustering algorithm for long fragment read analysis

Recent advances in long fragment read (LFR, also known as linked-read technologies or read-cloud) technologies, such as single tube long fragment reads (stLFR), 10X Genomics Chromium reads, and TruSeq synthetic long-reads, have enabled efficient haplotyping and genome assembly. However, in the case of stLFR and 10X Genomics Chromium reads, the long fragments of a genome are covered sparsely by reads in each barcode and most barcodes are contained in multiple long fragments from different regions, which results in inefficient assembly when using long-range information. Thus, methods to address these shortcomings are vital for capitalizing on the additional information obtained using these technologies. We therefore designed IterCluster, a novel, alignment-free clustering algorithm that can cluster barcodes from the same target region of a genome, using -mer frequency-based features and a Markov Cluster (MCL) approach to identify enough reads in a target region of a genome to ensure sufficient target genome sequence depth. The IterCluster method was validated using BGI stLFR and 10X Genomics chromium reads datasets. IterCluster had a higher precision and recall rate on BGI stLFR data compared to 10X Genomics Chromium read data. In addition, we demonstrated how IterCluster improves the de novo assembly results when using a divide-and-conquer strategy on a human genome data set (scaffold/contig N50 = 13.2 kbp/7.1 kbp vs. 17.1 kbp/11.9 kbp before and after IterCluster, respectively). IterCluster provides a new way for determining LFR barcode enrichment and a novel approach for de novo assembly using LFR data. IterCluster is OpenSource and available on https://github.com/JianCong-WENG/IterCluster.

[1] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2] Rajiv C. McCoy,et al. Illumina TruSeq Synthetic Long-Reads Empower De Novo Assembly and Resolve Complex, Highly-Repetitive Transposable Elements , 2014, bioRxiv.

[3] Zhongying Zhao,et al. Illumina Synthetic Long Read Sequencing Allows Recovery of Missing Sequences even in the “Finished” C. elegans Genome , 2015, Scientific Reports.

[4] Brian D. Ondov,et al. Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[5] Jian Wang,et al. Efficient and unique cobarcoding of second-generation sequencing reads from long DNA molecules enabling cost-effective and accurate sequencing, haplotyping, and de novo assembly , 2019, Genome research.

[6] Hanlee P. Ji,et al. Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[7] Aaron M. Newman,et al. The genome sequence of the colonial chordate, Botryllus schlosseri , 2013, eLife.

[8] Jian Wang,et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[9] Alexey A. Gurevich,et al. QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[10] Justin Chu,et al. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers , 2018, BMC Bioinformatics.

[11] Xun Xu,et al. Single tube bead-based DNA co-barcoding for cost effective and accurate sequencing, haplotyping, and assembly , 2018 .

[12] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[13] J. Landolin,et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[14] Serafim Batzoglou,et al. Genome assembly from synthetic long read clouds , 2016, Bioinform..

[15] Iman Hajirasouliha,et al. Minerva: an alignment- and reference-free approach to deconvolve Linked-Reads for metagenomics. , 2019, Genome research.

[16] N. Weisenfeld,et al. Direct determination of diploid genome sequences , 2016, bioRxiv.

[17] S. Dongen. Graph clustering by flow simulation , 2000 .