End-to-End Optimization of High-Throughput DNA Sequencing

At the core of Illumina's high-throughput DNA sequencing platforms lies a biophysical surface process that results in a random geometry of clusters of homogeneous short DNA fragments typically hundreds of base pairs long-bridge amplification. The statistical properties of this random process and the lengths of the fragments are critical as they affect the information that can be subsequently extracted, that is, density of successfully inferred DNA fragment reads. The ensembles of overlapping DNA fragment reads are then used to computationally reconstruct the much longer target genome sequence. The success of the reconstruction in turn depends on having a sufficiently large ensemble of DNA fragments that are sufficiently long. In this article using stochastic geometry, we model and optimize the end-to-end flow cell synthesis and target genome sequencing process, linking and partially controlling the statistics of the physical processes to the success of the final computational step. Based on a rough calibration of our model, we provide, for the first time, a mathematical framework capturing the salient features of the sequencing platform that serves as a basis for optimizing cost, performance, and/or sensitivity analysis to various parameters.

[1]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[2]  Haris Vikalo,et al.  Base calling for high-throughput short-read sequencing: dynamic programming solutions , 2013, BMC Bioinformatics.

[3]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[4]  Yun S. Song,et al.  naiveBayesCall: An Efficient Model-Based Base-Calling Algorithm for High-Throughput Sequencing , 2010, RECOMB.

[5]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[6]  Ward Whitt,et al.  A Unified Framework for Numerically Inverting Laplace Transforms , 2006, INFORMS J. Comput..

[7]  Haris Vikalo,et al.  OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing , 2012, Bioinform..

[8]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[9]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[10]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[11]  R. Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[12]  Jean-François Mercier,et al.  Solid phase DNA amplification: a Brownian dynamics study of crowding effects. , 2005, Biophysical journal.

[13]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[14]  Martin Kircher,et al.  Improved base calling for the Illumina Genome Analyzer using machine learning strategies , 2009, Genome Biology.

[15]  J. Craig Venter,et al.  A new strategy for genome sequencing , 1996, Nature.

[16]  G. Slater,et al.  Solid phase DNA amplification: a simple Monte Carlo Lattice model. , 2003, Biophysical journal.

[17]  François Baccelli,et al.  Stochastic geometry and wireless networks , 2009 .

[18]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[19]  David Tse,et al.  Information Theory of DNA Shotgun Sequencing , 2012, IEEE Transactions on Information Theory.

[20]  Richard K. Wilson,et al.  Aspects of coverage in medical DNA sequencing , 2008, BMC Bioinformatics.

[21]  M. Adams,et al.  Shotgun Sequencing of the Human Genome , 1998, Science.

[22]  P. Mitra,et al.  Alta-Cyclic: a self-optimizing base caller for next-generation sequencing , 2008, Nature Methods.

[23]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[24]  D. Stoyan,et al.  Stochastic Geometry and Its Applications , 1989 .

[25]  Martin Goodson,et al.  Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. , 2011, Genome research.

[26]  D. Stoyan,et al.  Stochastic Geometry and Its Applications , 1989 .

[27]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[28]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[29]  T. Mattfeldt Stochastic Geometry and Its Applications , 1996 .

[30]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[31]  J Messing,et al.  A system for shotgun DNA sequencing. , 1981, Nucleic acids research.

[32]  Yun S. Song,et al.  BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. , 2009, Genome research.

[33]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.