When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality

Since the invention of DNA sequencing in the seventies, computational biologists have had to deal with the problem de novo genome assembly with limited (or insufficient) depth of sequencing. In this work, for the first time we investigate the opposite problem, that is, the challenge of dealing with excessive depth of sequencing. Specifically, we explore the effect of ultra-deep sequencing data in two domains: (i) the problem of decoding reads to BAC clones (in the context of the combinatorial pooling design proposed in [1]), and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, we show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data. For the first problem, we propose an effective solution based on “divide and conquer”: we “slice” a large dataset into smaller samples of optimal size, decode each slice independently, then merge the results. Experimental results on over 15,000 barley BACs and over 4,000 cowpea BACs demonstrate a significant improvement in the quality of the decoding and the final assembly. For the second problem, we show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data.

[1]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[2]  J. Roach Random subcloning. , 1995, Genome research.

[3]  C. Soderlund,et al.  Contigs built with fingerprints, markers, and FPC V4.7. , 2000, Genome research.

[4]  R. Wing,et al.  A bacterial artificial chromosome library for barley (Hordeum vulgare L.) and the identification of clones containing putative resistance genes , 2000, Theoretical and Applied Genetics.

[5]  H. Shizuya,et al.  Five-color-based high-information-content fingerprinting of bacterial artificial chromosome clones using type IIS restriction endonucleases. , 2001, Genomics.

[6]  Carolyn Thomas,et al.  High-throughput fingerprinting of bacterial artificial chromosomes using the snapshot labeling kit and sizing of restriction fragments by capillary electrophoresis. , 2003, Genomics.

[7]  Galina Fuks,et al.  Whole-Genome Validation of High-Information-Content Fingerprinting1 , 2005, Plant Physiology.

[8]  Nicolas Thierry-Mieg,et al.  A new pooling strategy for high-throughput screening: the Shifted Transversal Design , 2006, BMC Bioinformatics.

[9]  Stefano Lonardi,et al.  A compartmentalized approach to the assembly of physical maps , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[10]  Stefano Lonardi,et al.  Deconvoluting the BAC-gene relationships using a physical map. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[11]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[12]  Douglas R. Smith,et al.  Assembly reconciliation , 2008, Bioinform..

[13]  Stefano Lonardi,et al.  Computing the Minimal Tiling Path from a Physical Map by Integer Linear Programming , 2008, WABI.

[14]  Marcel J. T. Reinders,et al.  Integrating genome assemblies with MAIA , 2010, Bioinform..

[15]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[16]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[17]  Mihaela M. Martis,et al.  A physical, genetic and functional sequence assembly of the barley genome. , 2022 .

[18]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[19]  Alberto Policriti,et al.  GAM-NGS: genomic assemblies merger for next generation sequencing , 2013, BMC Bioinformatics.

[20]  Atri Rudra,et al.  Accurate Decoding of Pooled Sequenced Data Using Compressed Sensing , 2013, WABI.

[21]  Gianfranco Ciardo,et al.  Combinatorial Pooling Enables Selective Sequencing of the Barley Gene Space , 2013, PLoS Comput. Biol..