NeatFreq: reference-free data reduction and coverage normalization for De Novo sequence assembly

BackgroundDeep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.ResultsHere we introduce NeatFreq, a software tool that reduces a data set to more uniform coverage by clustering and selecting from reads binned by their median kmer frequency (RMKF) and uniqueness. Previous algorithms normalize read coverage based on RMKF, but do not include methods for the preferred selection of (1) extremely low coverage regions produced by extremely variable sequencing of random-primed products and (2) 2-sided paired-end sequences. The algorithm increases the incorporation of the most unique, lowest coverage, segments of a genome using an error-corrected data set. NeatFreq was applied to bacterial, viral plaque, and single-cell sequencing data. The algorithm showed an increase in the rate at which the most unique reads in a genome were included in the assembled consensus while also reducing the count of duplicative and erroneous contigs (strings of high confidence overlaps) in the deliverable consensus. The results obtained from conventional Overlap-Layout-Consensus (OLC) were compared to simulated multi-de Bruijn graph assembly alternatives trained for variable coverage input using sequence before and after normalization of coverage. Coverage reduction was shown to increase processing speed and reduce memory requirements when using conventional bacterial assembly algorithms.ConclusionsThe normalization of deep coverage spikes, which would otherwise inhibit consensus resolution, enables High Throughput Sequencing (HTS) assembly projects to consistently run to completion with existing assembly software. The NeatFreq software package is free, open source and available at https://github.com/bioh4x/NeatFreq.

[1]  Hamidreza Chitsaz,et al.  Candidate phylum TM6 genome recovered from a hospital sink biofilm provides genomic insights into this uncultivated phylum , 2013, Proceedings of the National Academy of Sciences.

[2]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[3]  Forest Rohwer,et al.  Metagenomic Analysis of Respiratory Tract DNA Viral Communities in Cystic Fibrosis and Non-Cystic Fibrosis Individuals , 2009, PloS one.

[4]  B. Haas,et al.  A clustering method for repeat analysis in DNA sequences , 2001, Genome Biology.

[5]  Daikichi Mukoyama,et al.  Whole-metagenome amplification of a microbial community associated with scleractinian coral by multiple displacement amplification using phi29 polymerase. , 2006, Environmental microbiology.

[6]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[7]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[8]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[9]  F. Dean,et al.  Rapid amplification of plasmid and phage DNA using Phi 29 DNA polymerase and multiply-primed rolling circle amplification. , 2001, Genome research.

[10]  Pavel A Pevzner,et al.  Genome of the pathogen Porphyromonas gingivalis recovered from a biofilm in a hospital sink using a high-throughput single-cell genomics platform , 2013, Genome research.

[11]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[12]  G. Church,et al.  Sequencing genomes from single cells by polymerase cloning , 2006, Nature Biotechnology.

[13]  Ruben E. Valas,et al.  Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage , 2011, The ISME Journal.

[14]  Keith A. Boroevich,et al.  Assessing the feasibility of GS FLX Pyrosequencing for sequencing the Atlantic salmon genome , 2008, BMC Genomics.

[15]  Bastien Chevreux MIRA: An Automated Genome and EST Assembler , 2007 .

[16]  R. Lasken,et al.  Genomic DNA Amplification from a Single Bacterium , 2005, Applied and Environmental Microbiology.

[17]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[18]  R. Lasken Genomic sequencing of uncultured microorganisms from single cells , 2012, Nature Reviews Microbiology.

[19]  Bin Zhou,et al.  Sequencing viral genomes from a single isolated plaque , 2013, Virology Journal.

[20]  P. Pevzner,et al.  Efficient de novo assembly of single-cell bacterial genomes from short-read data sets , 2011, Nature Biotechnology.

[21]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[22]  Lisa Zeigler Allen,et al.  Single Virus Genomics: A New Tool for Virus Discovery , 2011, PloS one.

[23]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[24]  Roger S Lasken,et al.  Genomic DNA amplification by the multiple displacement amplification (MDA) method. , 2009, Biochemical Society transactions.

[25]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[26]  BMC Bioinformatics , 2005 .

[27]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[28]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[29]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.