A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data

Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes has enabled investigation of a wide range of organisms and ecosystems. However, sampling variation in short-read data sets and high sequencing error rates of modern sequencers present many new computational challenges in data interpretation. These challenges have led to the development of new classes of mapping tools and {\em de novo} assemblers. These algorithms are challenged by the continued improvement in sequencing throughput. We here describe digital normalization, a single-pass computational algorithm that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors. Digital normalization substantially reduces the size of shotgun data sets and decreases the memory and time requirements for {\em de novo} sequence assembly, all without significantly impacting content of the generated contigs. We apply digital normalization to the assembly of microbial genomic data, amplified single-cell genomic data, and transcriptomic data. Our implementation is freely available for use and modification.

[1]  K. N. Dollman,et al.  - 1 , 1743 .

[2]  M. Soares,et al.  Construction and characterization of a normalized cDNA library. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[3]  M. Soares,et al.  Normalization and subtraction: two approaches to facilitate gene discovery. , 1996, Genome research.

[4]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[5]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[6]  K. Sermon,et al.  Whole-genome multiple displacement amplification from single cells , 2006, Nature Protocols.

[7]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[8]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[9]  Franco P. Preparata,et al.  Spectrum-Based De Novo Repeat Detection in Genomic Sequences , 2008, J. Comput. Biol..

[10]  Wanjun Gu,et al.  Identification of repeat structure in large genomes using repeat probability clouds. , 2008, Analytical biochemistry.

[11]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[12]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[13]  Michael C. Schatz,et al.  CloudBurst: highly sensitive read mapping with MapReduce , 2009, Bioinform..

[14]  Sallie W. Chisholm,et al.  Whole Genome Amplification and De novo Assembly of Single Bacterial Cells , 2009, PloS one.

[15]  Cole Trapnell,et al.  How to map billions of short reads onto genomes , 2009, Nature Biotechnology.

[16]  Mihai Pop,et al.  Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing , 2009, J. Comput. Biol..

[17]  J. Macas,et al.  Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data , 2010, BMC Bioinformatics.

[18]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[19]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[20]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[21]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[22]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[23]  Lars Bolund,et al.  State of the art de novo assembly of human genomes from massively parallel sequencing data , 2010, Human Genomics.

[24]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[25]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[26]  Alexander Sczyrba,et al.  Decontamination of MDA Reagents for Single Cell Whole Genome Amplification , 2011, PloS one.

[27]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[28]  Hideaki Tanaka,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011, BCB '11.

[29]  Sergey Koren,et al.  Bambus 2: scaffolding metagenomes , 2011, Bioinform..

[30]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[31]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[32]  P. Pevzner,et al.  Efficient de novo assembly of single-cell bacterial genomes from short-read data sets , 2011, Nature Biotechnology.

[33]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[34]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[35]  Siu-Ming Yiu,et al.  Meta-IDBA: a de Novo assembler for metagenomic data , 2011, Bioinform..

[36]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[37]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[38]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[39]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.