An Efficient Approach to Merging Paired-End Reads and Incorporation of Uncertainties

Next-Generation Sequencing (NGS) technologies have reshaped the landscape of life sciences. The massive amount of data generated by NGS is rapidly transforming biological research from traditional wet-lab work into a data- intensive analytical discipline (Koboldt et al., Cell 155(1):27–38, 2013). The Illumina “sequencing by synthesis” technique (Mardis, Annu Rev Genomics Hum Genet 9:387–402, 2008) is one of the most popular and widely used NGS technologies.

[1]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[2]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[3]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[4]  Konrad H. Paszkiewicz,et al.  De novo assembly of short sequence reads , 2010, Briefings Bioinform..

[5]  Robert C. Edgar,et al.  Error filtering, pair assembly and error correction for next-generation sequencing reads , 2015, Bioinform..

[6]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[7]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[8]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[9]  R. Wilson,et al.  The Next-Generation Sequencing Revolution and Its Impact on Genomics , 2013, Cell.

[10]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[13]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[16]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[17]  Daniel G. Brown,et al.  PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[18]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[19]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.