Compareads: comparing huge metagenomic experiments

BackgroundNowadays, metagenomic sample analyses are mainly achieved by comparing them with a priori knowledge stored in data banks. While powerful, such approaches do not allow to exploit unknown and/or "unculturable" species, for instance estimated at 99% for Bacteria.MethodsThis work introduces Compareads, a de novo comparative metagenomic approach that returns the reads that are similar between two possibly metagenomic datasets generated by High Throughput Sequencers. One originality of this work consists in its ability to deal with huge datasets. The second main contribution presented in this paper is the design of a probabilistic data structure based on Bloom filters enabling to index millions of reads with a limited memory footprint and a controlled error rate.ResultsWe show that Compareads enables to retrieve biological information while being able to scale to huge datasets. Its time and memory features make Compareads usable on read sets each composed of more than 100 million Illumina reads in a few hours and consuming 4 GB of memory, and thus usable on today's personal computers.ConclusionUsing a new data structure, Compareads is a practical solution for comparing de novo huge metagenomic samples. Compareads is released under the CeCILL license and can be freely downloaded from http://alcovna.genouest.org/compareads/.

[1]  Tyson A. Clark,et al.  HITS-CLIP yields genome-wide insights into brain alternative RNA processing , 2008, Nature.

[2]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[3]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[4]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[5]  Eric Karsenti Towards an ‘Oceans Systems Biology' , 2012, Molecular systems biology.

[6]  Alison S. Waller,et al.  Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data , 2012, PloS one.

[7]  P. Bork,et al.  Environments shape the nucleotide composition of genomes , 2005, EMBO reports.

[8]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[9]  I-Min A. Chen,et al.  IMG/M: the integrated metagenome data management and comparative analysis system , 2011, Nucleic Acids Res..

[10]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[11]  Alexander Goesmann,et al.  Comparative and Joint Analysis of Two Metagenomic Datasets from a Biogas Fermenter Obtained by 454-Pyrosequencing , 2011, PloS one.

[12]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation. , 1995, Microbiological reviews.

[13]  Veerle Fack,et al.  Prospects and limitations of full-text index structures in genome analysis , 2012, Nucleic acids research.

[14]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[15]  P. Bork,et al.  Prediction of effective genome size in metagenomic samples , 2007, Genome Biology.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[18]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[19]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[20]  G. Church,et al.  Functional Characterization of the Antibiotic Resistance Reservoir in the Human Microflora , 2009, Science.

[21]  M. Blaxter,et al.  RADSeq: next-generation population genetics. , 2010, Briefings in functional genomics.

[22]  Elizabeth M Glass,et al.  From genomics to metagenomics. , 2012, Current opinion in biotechnology.