IDBA-MTP: A Hybrid MetaTranscriptomic Assembler Based on Protein Information

Metatranscriptomic analysis provides information on how a microbial community reacts to environmental changes. Using next-generation sequencing NGS technology, biologists can study microbe community by sampling short reads from a mixture of mRNAs metatranscriptomic data. As most microbial genome sequences are unknown, it would seem that de novo assembly of the mRNAs is needed. However, NGS reads are short and mRNAs share many similar regions and differ tremendously in abundance levels, making de novo assembly challenging. The existing assembler, IDBA-MT, designed specifically for the assembly of metatranscriptomic data only performs well on high-expressed mRNAs. This paper introduces IDBA-MTP, which adopts a novel approach to metatranscriptomic assembly that makes use of the fact that there is a database of millions of known protein sequences associated with mRNAs. How to effectively use the protein information is non-trivial given the size of the database and given that different mRNAs might lead to proteins with similar functions because different amino acids might have similar characteristics. IDBA-MTP employs a similarity measure between mRNAs and protein sequences, dynamic programming techniques and seed-and-extend heuristics to tackle the problem effectively and efficiently. Experimental results show that IDBA-MTP outperforms existing assemblers by reconstructing 14% more mRNAs. Availability: www.cs.hku.hk/~alse/hkubrg/.

[1]  Siu-Ming Yiu,et al.  Meta-IDBA: a de Novo assembler for metagenomic data , 2011, Bioinform..

[2]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[4]  M. Moran,et al.  Analysis of Microbial Gene Transcripts in Environmental Samples , 2005, Applied and Environmental Microbiology.

[5]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[6]  Mercedes Moreno-Paz,et al.  Analysis of environmental transcriptomes by DNA microarrays. , 2007, Environmental microbiology.

[7]  J. Gilbert,et al.  Detection of Large Numbers of Novel Sequences in the Metatranscriptomes of Complex Marine Microbial Communities , 2008, PloS one.

[8]  T. Urich,et al.  Archaea predominate among ammonia-oxidizing prokaryotes in soils , 2006, Nature.

[9]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[10]  Alexander N. Glazer,et al.  Conserved Amino Acid Sequence Features in the α Subunits of MoFe, VFe, and FeFe Nitrogenases , 2009, PloS one.

[11]  Maureen L. Coleman,et al.  Microbial community gene expression in ocean surface waters , 2008, Proceedings of the National Academy of Sciences.

[12]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[13]  Jos Boekhorst,et al.  Metatranscriptome Analysis of the Human Fecal Microbiota Reveals Subject-Specific Expression Profiles, with Genes Encoding Proteins Involved in Carbohydrate Metabolism Being Dominantly Expressed , 2010, Applied and Environmental Microbiology.

[14]  Jonathan A Eisen,et al.  Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes , 2007, PLoS biology.

[15]  John Parkinson,et al.  Generation and Analysis of a Mouse Intestinal Metatranscriptome through Illumina Based RNA-Sequencing , 2012, PloS one.

[16]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[17]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[18]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[19]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[20]  W. Grody,et al.  Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. , 2008, The Journal of molecular diagnostics : JMD.

[21]  Daniel H. Huson,et al.  Simultaneous Assessment of Soil Microbial Community Structure and Function through Analysis of the Meta-Transcriptome , 2008, PloS one.

[22]  Siu-Ming Yiu,et al.  IDBA-MT: De Novo Assembler for Metatranscriptomic Data Generated from Next-Generation Sequencing Technology , 2013, J. Comput. Biol..

[23]  J. Mullikin,et al.  The phusion assembler. , 2003, Genome research.

[24]  Rustam I. Aminov,et al.  Predominant Role of Host Genetics in Controlling the Composition of Gut Microbiota , 2008, PloS one.

[25]  X. Zhou,et al.  Parallel metatranscriptome analyses of host and symbiont gene expression in the gut of the termite Reticulitermes flavipes , 2009, Biotechnology for biofuels.

[26]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[27]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[28]  E. Liu,et al.  Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. , 2009, Genome research.

[29]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[30]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[31]  Joakim Lundeberg,et al.  Generations of sequencing technologies. , 2009, Genomics.

[32]  Mary Ann Moran,et al.  Transporter genes expressed by coastal bacterioplankton in response to dissolved organic carbon , 2010, Environmental microbiology.