Fast and simple protein-alignment-guided assembly of orthologous gene families from microbiome sequencing reads

BackgroundMicrobiome sequencing projects typically collect tens of millions of short reads per sample. Depending on the goals of the project, the short reads can either be subjected to direct sequence analysis or be assembled into longer contigs. The assembly of whole genomes from metagenomic sequencing reads is a very difficult problem. However, for some questions, only specific genes of interest need to be assembled. This is then a gene-centric assembly where the goal is to assemble reads into contigs for a family of orthologous genes.MethodsWe present a new method for performing gene-centric assembly, called protein-alignment-guided assembly, and provide an implementation in our metagenome analysis tool MEGAN. Genes are assembled on the fly, based on the alignment of all reads against a protein reference database such as NCBI-nr. Specifically, the user selects a gene family based on a classification such as KEGG and all reads binned to that gene family are assembled.ResultsUsing published synthetic community metagenome sequencing reads and a set of 41 gene families, we show that the performance of this approach compares favorably with that of full-featured assemblers and that of a recently published HMM-based gene-centric assembler, both in terms of the number of reference genes detected and of the percentage of reference sequence covered.ConclusionsProtein-alignment-guided assembly of orthologous gene families complements whole-metagenome assembly in a new and very useful way.

[1]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[2]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[3]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[4]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[5]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[6]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[7]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[8]  Daniel D. Sommer,et al.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline , 2013, Genome Biology.

[9]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[10]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[11]  J. Eisen,et al.  Systematic Identification of Gene Families for Use as “Markers” for Phylogenetic and Phylogeny-Driven Ecological Studies of Bacteria and Archaea and Their Major Subgroups , 2013, PloS one.

[12]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.

[13]  D. Postma,et al.  Inhaled Steroids Modulate Extracellular Matrix Composition in Bronchial Biopsies of COPD Patients: A Randomized, Controlled Trial , 2013, PloS one.

[14]  Fangfang Xia,et al.  The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) , 2013, Nucleic Acids Res..

[15]  S. Tringe,et al.  Tackling soil diversity with the assembly of large, complex metagenomes , 2014, Proceedings of the National Academy of Sciences.

[16]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[17]  H. Neve,et al.  Optimizing protocols for extraction of bacteriophages prior to metagenomic analyses of phage communities in the human gut , 2015, Microbiome.

[18]  Jordan A. Fish,et al.  Xander: employing a novel method for efficient gene-targeted metagenomic assembly , 2015, Microbiome.

[19]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[20]  Daniel H. Huson,et al.  MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data , 2016, PLoS Comput. Biol..

[21]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[22]  Zhi-hua Chen,et al.  Kyoto Encyclopedia of Genes and Genomes were used for functional enrichment analysis of differentially expressed genes (DEGs). A protein‐protein interaction network was constructed, and the hub genes were subjected to module analysis and identification using Search Tool for the Retrieval , 2019 .