NCycDB: a curated integrative database for fast and accurate metagenomic profiling of nitrogen cycling genes

Motivation The nitrogen (N) cycle is a collection of important biogeochemical pathways in the Earth ecosystem and has gained extensive foci in ecology and environmental studies. Currently, shotgun metagenome sequencing has been widely applied to explore gene families responsible for N cycle processes. However, there are problems in applying publically available orthology databases to profile N cycle gene families in shotgun metagenomes, such as inefficient database searching, unspecific orthology groups and low coverage of N cycle genes and/or gene (sub)families. Results To solve these issues, this study built a manually curated integrative database (NCycDB) for fast and accurate profiling of N cycle gene (sub)families from shotgun metagenome sequencing data. NCycDB contains a total of 68 gene (sub)families and covers eight N cycle processes with 84 759 and 219 146 representative sequences at 95 and 100% identity cutoffs, respectively. We also identified 1958 homologous orthology groups and included corresponding sequences in the database to avoid false positive assignments due to ‘small database’ issues. We applied NCycDB to characterize N cycle gene (sub)families in 52 shotgun metagenomes from the Global Ocean Sampling expedition. Further analysis showed that the structure and composition of N cycle gene families were most strongly correlated with latitude and temperature. NCycDB is expected to facilitate N cycle studies via shotgun metagenome sequencing approaches in various environments. The framework developed in this study can be served as a good reference to build similar knowledge‐based functional gene databases in various processes and pathways. Availability and implementation NCycDB database files are available at https://github.com/qichao1984/NCyc. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Paul G. Falkowski,et al.  Evolution of the nitrogen cycle and its influence on the biological sequestration of CO2 in the ocean , 1997, Nature.

[2]  T. Urich,et al.  Archaea predominate among ammonia-oxidizing prokaryotes in soils , 2006, Nature.

[3]  Walter Jetz,et al.  Global patterns and predictors of marine biodiversity across taxa , 2010, Nature.

[4]  P. Chain,et al.  Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. , 2012, Current opinion in biotechnology.

[5]  A Costello,et al.  Evidence that particulate methane monooxygenase and ammonia monooxygenase may be evolutionarily related. , 1995, FEMS microbiology letters.

[6]  Kai Xue,et al.  The Diversity and Co-occurrence Patterns of N2-Fixing Communities in a CO2-Enriched Grassland Ecosystem , 2016, Microbial Ecology.

[7]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[8]  Michael Y. Galperin,et al.  Expanded microbial genome coverage and improved protein family annotation in the COG database , 2014, Nucleic Acids Res..

[9]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[10]  Ye Deng,et al.  Biogeographic patterns of soil diazotrophic communities across six forests in the North America , 2016, Molecular ecology.

[11]  S. Tringe,et al.  Patterns in Wetland Microbial Community Composition and Functional Gene Repertoire Associated with Methane Emissions , 2015, mBio.

[12]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[13]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[14]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[15]  Ye Deng,et al.  Development of functional gene microarrays for microbial community analysis. , 2012, Current opinion in biotechnology.

[16]  Susannah G. Tringe,et al.  FOAM (Functional Ontology Assignments for Metagenomes): a Hidden Markov Model (HMM) database with environmental focus , 2014, Nucleic acids research.

[17]  Mary Firestone,et al.  Abundance of microbial genes associated with nitrogen cycling as indices of biogeochemical process rates across a vegetation gradient in Alaska. , 2012, Environmental microbiology.

[18]  J. Galloway,et al.  An Earth-system perspective of the global nitrogen cycle , 2008, Nature.

[19]  James H. Brown,et al.  Toward a metabolic theory of ecology , 2004 .

[20]  G. Asner,et al.  Nitrogen Cycles: Past, Present, and Future , 2004 .

[21]  Andreas Wilke,et al.  The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools , 2012, BMC Bioinformatics.

[22]  E. Kandeler,et al.  Abundance of narG, nirS, nirK, and nosZ Genes of Denitrifying Bacteria during Primary Successions of a Glacier Foreland , 2006, Applied and Environmental Microbiology.

[23]  Daniel H. Buckley,et al.  A comprehensive aligned nifH gene database: a multipurpose tool for studies of nitrogen-fixing bacteria , 2014, Database J. Biol. Databases Curation.

[24]  Kristin Bergauer,et al.  Archaeal amoA gene diversity points to distinct biogeography of ammonia-oxidizing Crenarchaeota in the ocean , 2013, Environmental microbiology.

[25]  Matthew Z. DeMaere,et al.  Global biogeography of SAR11 marine bacteria , 2012, Molecular systems biology.

[26]  Rob Knight,et al.  Metagenomics reveals sediment microbial community response to Deepwater Horizon oil spill , 2014, The ISME Journal.

[27]  James H Brown,et al.  A latitudinal diversity gradient in planktonic marine bacteria , 2008, Proceedings of the National Academy of Sciences.

[28]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[29]  Martiny,et al.  Global biogeography of microbial nitrogen-cycling traits in soil , 2016 .

[30]  Kai Xue,et al.  Metagenomic reconstruction of nitrogen cycling pathways in a CO2-enriched grassland ecosystem , 2017 .

[31]  Jed A. Fuhrman,et al.  Marine microbial community dynamics and their ecological interpretation , 2015, Nature Reviews Microbiology.

[32]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[33]  Bess B. Ward,et al.  Development and Testing of a DNA Macroarray To Assess Nitrogenase (nifH) Gene Diversity , 2004, Applied and Environmental Microbiology.

[34]  D. Arrouays,et al.  Determinants of the distribution of nitrogen-cycling microbial communities at the landscape scale , 2010, The ISME Journal.

[35]  Konstantinos T. Konstantinidis,et al.  Towards a Genome-Based Taxonomy for Prokaryotes , 2005, Journal of bacteriology.

[36]  Jonathan P Zehr,et al.  nifH pyrosequencing reveals the potential for location-specific soil chemistry to influence N2 -fixing community dynamics. , 2014, Environmental microbiology.

[37]  K. Pollard,et al.  Toward Accurate and Quantitative Comparative Metagenomics , 2016, Cell.

[38]  Jonathan P. Zehr,et al.  ARBitrator: a software pipeline for on-demand retrieval of auto-curated nifH sequences from GenBank , 2014, Bioinform..

[39]  Paul G Falkowski,et al.  The Evolution and Future of Earth’s Nitrogen Cycle , 2010, Science.

[40]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[41]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[42]  Hao Yu,et al.  GeoChip 4: a functional gene‐array‐based high‐throughput environmental technology for microbial community analysis , 2014, Molecular ecology resources.

[43]  P. Vannier,et al.  Biogeography of Marine Microorganisms , 2016 .

[44]  M. Kuypers,et al.  New processes and players in the nitrogen cycle: the microbial ecology of anaerobic and archaeal ammonia oxidation , 2007, The ISME Journal.

[45]  Maureen O’Callaghan,et al.  Nitrification driven by bacteria and not archaea in nitrogen-rich grassland soils , 2009 .

[46]  Helmut Hillebrand,et al.  On the Generality of the Latitudinal Diversity Gradient , 2004, The American Naturalist.

[47]  Objectives,et al.  Summary , 1970 .

[48]  K. Konstantinidis,et al.  Genomic insights that advance the species definition for prokaryotes. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[49]  Jordan A. Fish,et al.  FunGene: the functional gene pipeline and repository , 2013, Front. Microbiol..