A crowdsourcing method for correcting sequencing errors for the third-generation sequencing data

The third generation sequencing data exposes great advantage on read length, which extremely benefits the genomic analyses. However, the third generation sequencing data implies error models different from the ones that the second generation data brings. It is suggested to correct sequencing errors, which could significantly reduce false positives in downstream analyses. Existing error correction approaches often suffer accuracy loss when the hybrid reads present diversity or the coverage varies. In this paper, we propose a novel method based on crowdsourcing strategy, which is implemented as CLTC. CLTC is also a hybrid correction algorithm, which consists of four steps. The second generation reads are first collected and mapped to the third generation reads. Then, the base difficult level is defined to describe the diversities on a base among a group of 2nd-generation reads covered it. The capability is evaluated for each 2nd-generation read, which considers the base difficult levels across the read, the consistency among overlapped reads and the mapping quality between the 2nd- and 3rd-generation reads. A heuristic algorithm is designed for the calculation of capabilities. An expectation-maximization algorithm is finally used to compute the corrected result for each base-pair. We test CLTC on different datasets and compare to the existing approaches. The results demonstrate that CLTC is able to achieve higher accuracy and performs faster than the existing ones.

[1]  T Laver,et al.  Assessing the performance of the Oxford Nanopore Technologies MinION , 2015, Biomolecular detection and quantification.

[2]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[3]  Esko Ukkonen,et al.  Accurate self-correction of errors in long reads using de Bruijn graphs , 2016, Bioinform..

[4]  Jing Xu,et al.  Accurately Estimating Tumor Purity of Samples with High Degree of Heterogeneity from Cancer Sequencing Data , 2017, ICIC.

[5]  Jin Zhang,et al.  An improved approach for accurate and efficient calling of structural variations with low-coverage sequence data , 2012, BMC Bioinformatics.

[6]  Joshua F. McMichael,et al.  Systematic Discovery of Complex Indels in Human Cancers , 2015, Nature medicine.

[7]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[8]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[9]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[10]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[11]  Joshua F. McMichael,et al.  Age-related cancer mutations associated with clonal hematopoietic expansion , 2014, Nature Medicine.

[12]  Jing Xu,et al.  Identifying Heterogeneity Patterns of Allelic Imbalance on Germline Variants to Infer Clonal Architecture , 2017, ICIC.

[13]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[14]  Hamid R. Rabiee,et al.  A unified statistical framework for crowd labeling , 2015, Knowledge and Information Systems.

[15]  Li Ding,et al.  Patterns and functional implications of rare germline variants across 12 cancer types , 2015, Nature Communications.

[16]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[17]  J. Landolin,et al.  Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing , 2014 .

[18]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[19]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.