Benchmarking of computational error-correction methods for next-generation sequencing via unique molecular identifiers

Background Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error-correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. Results In this paper, we evaluate the ability of error-correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error correction methods. Conclusions In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity

[1]  Nicholas C. Wu,et al.  A benchmark study on error-correction by read-pairing and tag-clustering in amplicon-based deep sequencing , 2016, BMC Genomics.

[2]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[3]  K. Kinzler,et al.  Detection and quantification of rare mutations with massively parallel sequencing , 2011, Proceedings of the National Academy of Sciences.

[4]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[5]  K. Robasky,et al.  The role of replicates for error mitigation in next-generation sequencing , 2013, Nature Reviews Genetics.

[6]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[7]  Lucian Ilie,et al.  Correcting Illumina data , 2015, Briefings Bioinform..

[8]  Jesse J. Salk,et al.  Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations , 2018, Nature Reviews Genetics.

[9]  Boyu Zhang,et al.  Chromatin accessibility contributes to simultaneous mutations of cancer genes , 2016, Scientific Reports.

[10]  W. Miller,et al.  Comparison of Sequencing Platforms for Single Nucleotide Variant Calls in a Human Sample , 2013, PloS one.

[11]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[12]  Piet Demeester,et al.  Evaluation of the impact of Illumina error correction tools on de novo genome assembly , 2017, BMC Bioinformatics.

[13]  X. Xie,et al.  Highly accurate fluorogenic DNA sequencing with information theory–based error correction , 2017, Nature Biotechnology.

[14]  Daniel G. Brown,et al.  Pollux: platform independent error correction of single and mixed genomes , 2015, BMC Bioinformatics.

[15]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[16]  Samuel P. Strom Current practices and guidelines for clinical next-generation sequencing oncology testing , 2016, Cancer biology & medicine.

[17]  S. Deorowicz,et al.  RECKONER: read error corrector based on KMC , 2016, Bioinform..

[18]  Lana S. Martin,et al.  Systematic benchmarking of omics computational tools , 2019, Nature Communications.

[19]  E. Eskin,et al.  Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing , 2017, bioRxiv.

[20]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[21]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[22]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[23]  Tamir Tuller,et al.  Tracking the evolution of 3D gene organization demonstrates its connection to phenotypic divergence , 2017, Nucleic acids research.

[24]  P. Chain,et al.  Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. , 2012, Current opinion in biotechnology.

[25]  Eleazar Eskin,et al.  Accurate viral population assembly from ultra-deep sequencing data , 2014, Bioinform..

[26]  Lana S. Martin,et al.  Benchmarking of computational error-correction methods for next-generation sequencing data , 2020, Genome Biology.

[27]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[28]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[29]  Nir Friedman,et al.  Dynamic Perturbations of the T-Cell Receptor Repertoire in Chronic HIV Infection and following Antiretroviral Therapy , 2015, Front. Immunol..

[30]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[31]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[32]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[33]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[34]  Xiaotu Ma,et al.  Analysis of error profiles in deep next-generation sequencing data , 2019, Genome Biology.

[35]  Samuel P. Strom Current practices and guidelines for clinical next-generation sequencing oncology testing@@@Current practices and guidelines for clinical next-generation sequencing oncology testing , 2016 .

[36]  Yongchao Liu,et al.  HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data , 2014, BMC Bioinformatics.