Comparison of Metatranscriptomic Samples Based on k-Tuple Frequencies

Background The comparison of samples, or beta diversity, is one of the essential problems in ecological studies. Next generation sequencing (NGS) technologies make it possible to obtain large amounts of metagenomic and metatranscriptomic short read sequences across many microbial communities. De novo assembly of the short reads can be especially challenging because the number of genomes and their sequences are generally unknown and the coverage of each genome can be very low, where the traditional alignment-based sequence comparison methods cannot be used. Alignment-free approaches based on k-tuple frequencies, on the other hand, have yielded promising results for the comparison of metagenomic samples. However, it is not known if these approaches can be used for the comparison of metatranscriptome datasets and which dissimilarity measures perform the best. Results We applied several beta diversity measures based on k-tuple frequencies to real metatranscriptomic datasets from pyrosequencing 454 and Illumina sequencing platforms to evaluate their effectiveness for the clustering of metatranscriptomic samples, including three dissimilarity measures, one dissimilarity measure in CVTree, one relative entropy based measure S2 and three classical distances. Results showed that the measure can achieve superior performance on clustering metatranscriptomic samples into different groups under different sequencing depths for both 454 and Illumina datasets, recovering environmental gradients affecting microbial samples, classifying coexisting metagenomic and metatranscriptomic datasets, and being robust to sequencing errors. We also investigated the effects of tuple size and order of the background Markov model. A software pipeline to implement all the steps of analysis is built and is available at http://code.google.com/p/d2-tools/. Conclusions The k-tuple based sequence signature measures can effectively reveal major groups and gradient variation among metatranscriptomic samples from NGS reads. The dissimilarity measure performs well in all application scenarios and its performance is robust with respect to tuple size and order of the Markov model.

[1]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[3]  Susan M. Huse,et al.  The Taxonomic and Functional Diversity of Microbes at a Temperate Coastal Site: A ‘Multi-Omic’ Study of Seasonal and Diel Temporal Variation , 2010, PloS one.

[4]  J. Gilbert,et al.  Metagenomes and metatranscriptomes from the L4 long-term coastal monitoring station in the Western English Channel , 2010, Standards in genomic sciences.

[5]  Elizabeth Tapia,et al.  Multiclass classification of microarray data samples with a reduced number of genes , 2011, BMC Bioinformatics.

[6]  D. Willner,et al.  Metagenomic signatures of 86 microbial and viral metagenomes. , 2009, Environmental microbiology.

[7]  B. Tebo,et al.  Microbial diversity and biogeochemistry of the Guaymas Basin deep-sea hydrothermal plume. , 2010, Environmental microbiology.

[8]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[9]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[10]  R. Amann,et al.  Application of tetranucleotide frequencies for the assignment of genomic fragments. , 2004, Environmental microbiology.

[11]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[12]  Devdatt P. Dubhashi,et al.  Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures , 2006, Bioinform..

[13]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[14]  M. Blaser,et al.  Evolutionary implications of microbial genome tetranucleotide frequency biases. , 2003, Genome research.

[15]  J. Handelsman,et al.  Introducing TreeClimber, a Test To Compare Microbial Community Structures , 2006, Applied and Environmental Microbiology.

[16]  S Karlin,et al.  Compositional biases of bacterial genomes and evolutionary implications , 1997, Journal of bacteriology.

[17]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[18]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[19]  M. Moran,et al.  Metatranscriptomic signature of exogenous polyamine utilization by coastal bacterioplankton. , 2011, Environmental microbiology reports.

[20]  John Parkinson,et al.  Generation and Analysis of a Mouse Intestinal Metatranscriptome through Illumina Based RNA-Sequencing , 2012, PloS one.

[21]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[22]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[23]  Yanchun Yang,et al.  Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison , 2008, Bioinform..

[24]  B. Tebo,et al.  Enzymatic microbial Mn(II) oxidation and Mn biooxide production in the Guaymas Basin deep-sea hydrothermal plume , 2009 .

[25]  Ting Chen,et al.  PyroHMMsnp: an SNP caller for Ion Torrent and 454 sequencing data , 2013, Nucleic acids research.

[26]  E. Delong,et al.  Microbial community transcriptomes reveal microbes and metabolic pathways associated with dissolved organic matter turnover in the sea , 2010, Proceedings of the National Academy of Sciences.

[27]  Monzoorul Haque Mohammed,et al.  HabiSign: a novel approach for comparison of metagenomes and rapid identification of habitat-specific sequences , 2011, BMC Bioinformatics.

[28]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[29]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  E. Delong,et al.  Integrated metatranscriptomic and metagenomic analyses of stratified microbial assemblages in the open ocean , 2011, The ISME Journal.

[32]  R. Knight,et al.  UniFrac: an effective distance metric for microbial community comparison , 2011, The ISME Journal.

[33]  Mary Ann Moran,et al.  Comparative day/night metatranscriptomic analysis of microbial communities in the North Pacific subtropical gyre. , 2009, Environmental microbiology.

[34]  Brian C. Thomas,et al.  Community-wide analysis of microbial genome sequence signatures , 2009, Genome Biology.

[35]  Minghua Deng,et al.  Comparison of metagenomic samples using sequence signatures , 2012, BMC Genomics.

[36]  Qi Dai,et al.  Comparison study on k-word statistical measures for protein: From sequence to 'sequence space' , 2008, BMC Bioinformatics.