Reference-Free Validation of Short Read Data

Background High-throughput DNA sequencing techniques offer the ability to rapidly and cheaply sequence material such as whole genomes. However, the short-read data produced by these techniques can be biased or compromised at several stages in the sequencing process; the sources and properties of some of these biases are not always known. Accurate assessment of bias is required for experimental quality control, genome assembly, and interpretation of coverage results. An additional challenge is that, for new genomes or material from an unidentified source, there may be no reference available against which the reads can be checked. Results We propose analytical methods for identifying biases in a collection of short reads, without recourse to a reference. These, in conjunction with existing approaches, comprise a methodology that can be used to quantify the quality of a set of reads. Our methods involve use of three different measures: analysis of base calls; analysis of k-mers; and analysis of distributions of k-mers. We apply our methodology to wide range of short read data and show that, surprisingly, strong biases appear to be present. These include gross overrepresentation of some poly-base sequences, per-position biases towards some bases, and apparent preferences for some starting positions over others. Conclusions The existence of biases in short read data is known, but they appear to be greater and more diverse than identified in previous literature. Statistical analysis of a set of short reads can help identify issues prior to assembly or resequencing, and should help guide chemical or statistical methods for bias rectification.

[1]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[2]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[3]  S. Morishita,et al.  Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. , 2009, Genome research.

[4]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[5]  Ioannis Xenarios,et al.  BMC Bioinformatics BioMed Central Methodology article Probabilistic base calling of Solexa sequencing data , 2022 .

[6]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[7]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[8]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[9]  P. Mitra,et al.  Alta-Cyclic: a self-optimizing base caller for next-generation sequencing , 2008, Nature Methods.

[10]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[11]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[12]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[13]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[14]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[15]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[16]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[17]  Andreas von Bubnoff,et al.  Next-Generation Sequencing: The Race Is On , 2008, Cell.

[18]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.