A meta-genome sequencing and assembly preprocessing algorithm inspired by restriction site base composition

Motivation: In meta-genome sequencing and assembly projects, where there are different types of contigs mixed together in a single pool, the task of assembling its different organisms is a complex and challenging problem. It is therefore desirable to sort the contigs by origins into separate bins from which to work. We propose a framework of using the base compositions of bacterial restriction sites to generate sets of motifs which work to differentiate organismal groups, including the contigs from those groups. We introduce spectrum sets and show how to strategically select them for use in binning contigs from different organisms. We suggest that this framework can save time during a meta-genome sequencing and assembly project. Results: Our method is able to differentiate organisms and to successfully determine the association of the contigs which were derived from an organism. In particular, we show that two genera are fundamentally different by analyzing their motif proportions. Using one of the four total spectrum sets, which encompass all known restriction sites, we show that different sets have different abilities to distinguish sequences. In addition, we show that the selection of a spectrum set which is relevant to one organism, but not the other, greatly improves performance of differentiation, even when the contig size is short (1000bps). Conclusions: Using ten trials of newly selected contigs to confirm our premise, our study provides a proof of concept for a novel and computationally effective method for a preprocessing step in meta-genome sequencing and assembly tasks.

[1]  F. Hildebrand,et al.  Evidence of Selection upon Genomic GC-Content in Bacteria , 2010, PLoS genetics.

[2]  Daniel R. Zerbino,et al.  Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler , 2009, PloS one.

[3]  Yixue Li,et al.  A new strategy for better genome assembly from very short reads , 2011, BMC Bioinformatics.

[4]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[5]  Andrew D. Smith,et al.  Multiple Sequence Assembly from Reads Alignable to a Common Reference Genome , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[7]  D. Petrov,et al.  Evidence That Mutation Is Universally Biased towards AT in Bacteria , 2010, PLoS genetics.

[8]  Akhilesh K. Tyagi,et al.  De Novo Assembly of Chickpea Transcriptome Using Short Reads for Gene Discovery and Marker Identification , 2011, DNA research : an international journal for rapid publication of reports on genes and genomes.

[9]  Weng-Keen Wong,et al.  QSRA – a quality-value guided de novo short read assembler , 2009, BMC Bioinformatics.

[10]  E V Koonin,et al.  Avoidance of palindromic words in bacterial and archaeal genomes: a close connection with restriction enzymes. , 1997, Nucleic acids research.

[11]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[12]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[13]  Noah R. Fram,et al.  Across Bacterial Phyla, Distantly-Related Genomes with Similar Genomic GC Content Have Similar Patterns of Amino Acid Usage , 2011, PloS one.

[14]  Steven J. M. Jones,et al.  De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data , 2009, Genome Biology.

[15]  Pascal Mäser,et al.  Species-specific Typing of DNA Based on Palindrome Frequency Patterns , 2011, DNA research : an international journal for rapid publication of reports on genes and genomes.

[16]  Steven Skiena,et al.  Crystallizing short-read assemblies around seeds , 2009, BMC Bioinformatics.

[17]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.