Badread: simulation of error-prone long reads

DNA sequencing platforms aim to measure the sequence of nucleotides (A, C, G and T) in a sample of DNA. Sequencers made by Illumina have been the dominant technology for much of the past decade, but their platforms generate fragments of sequence (‘reads’) that are relatively small (~100–300 nucleotides in length). In contrast, Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) produce ‘long-read’ sequencers that can generate sequence fragments with tens of thousands of nucleotides or more (Eisenstein, 2017). Long reads from these platforms can be very beneficial for genome assembly and other bioinformatic analyses (Koren, Walenz, Berlin, Miller, & Phillippy, 2017; Phillippy, 2017). ONT and PacBio sequencers achieve their long read lengths because they detect nucleotides in individual molecules of DNA, a.k.a. single-molecule sequencing (Heather & Chain, 2016). However, the stochastic nature of measuring at the single-molecule scale means that ONT and PacBio reads are ‘noisy’ – they contain a significant amount of errors.