Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Significance Metagenome sequencing provides a useful repository to extract evolutionary information and assist protein structure predictions. The sequence-search process, however, becomes increasingly prohibitive due to the huge library size. We hypothesize that there exist inherent evolutionary linkages between microbial niches and protein families that can be used to construct precise multiple sequence alignments (MSAs). To examine the hypothesis, we built a model library of four major biomes containing 4.25 billion sequences. Large-scale protein folding experiments revealed that MSAs collected from individually linked microbiomes can generate more accurate contact and structure models than those from all microbiome sequences but use significantly fewer computing resources. These results demonstrate the potential to solve the metagenome-search problem using a microbiome targeted approach. Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning–based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.

[1]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[2]  I-Min A. Chen,et al.  The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities , 2020, Nucleic Acids Res..

[3]  Xiaogen Zhou,et al.  Deducing high-accuracy protein contact-maps from a triplet of coevolutionary matrices through deep residual convolutional networks , 2020, bioRxiv.

[4]  Heng Xu,et al.  Characteristics and in situ remediation effects of heavy metal immobilizing bacteria on cadmium and nickel co-contaminated soil. , 2020, Ecotoxicology and environmental safety.

[5]  M. Malamy,et al.  Genetic and Biochemical Analysis of Anaerobic Respiration in Bacteroides fragilis and Its Importance In Vivo , 2020, mBio.

[6]  David T. Jones,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[7]  Yang Zhang,et al.  DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins , 2019, Bioinform..

[8]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2019, Proceedings of the National Academy of Sciences.

[9]  Robert D. Finn,et al.  MGnify: the microbiome analysis resource in 2020 , 2019, Nucleic Acids Res..

[10]  Yang Zhang,et al.  Fueling ab initio folding with marine metagenomics enables structure and function predictions of new protein families , 2019, Genome Biology.

[11]  Rojan Shrestha,et al.  Assessing the accuracy of contact predictions in CASP13 , 2019, Proteins.

[12]  J. Jansson,et al.  Soil microbiomes and climate change , 2019, Nature Reviews Microbiology.

[13]  Yang Zhang,et al.  Detecting distant-homology protein structures by aligning deep neural-network based contact maps , 2019, PLoS Comput. Biol..

[14]  Yang Zhang,et al.  Ensembling multiple raw coevolutionary features with deep residual neural networks for contact‐map prediction in CASP13 , 2019, Proteins.

[15]  F. Rodríguez-Valera,et al.  Marine-freshwater prokaryotic transitions require extensive changes in the predicted proteome , 2019, Microbiome.

[16]  Yang Zhang,et al.  Deep‐learning contact‐map guided protein structure prediction in CASP13 , 2019, Proteins.

[17]  Matteo Dal Peraro,et al.  A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments , 2019, Proteins.

[18]  Yang Li,et al.  LOMETS2: improved meta-threading server for fold-recognition and structure-based function annotation for distant-homology proteins , 2019, Nucleic Acids Res..

[19]  Jun Hu,et al.  ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks , 2019, Bioinform..

[20]  Colin J. Brislawn,et al.  Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases , 2019, Nature.

[21]  M. Lindström,et al.  The Impact of Lignin Structural Diversity on Performance of Cellulose Nanofiber (CNF)-Starch Composite Films , 2019, Polymers.

[22]  A. Chojnacka,et al.  Cell factories converting lactate and acetate to butyrate: Clostridium butyricum and microbial communities from dark fermentation bioreactors , 2019, Microbial Cell Factories.

[23]  S. Wuertz,et al.  Bacteria and archaea on Earth and their abundance in biofilms , 2019, Nature Reviews Microbiology.

[24]  David T. Jones,et al.  Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints , 2018, Nature Communications.

[25]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[26]  P. Graumann,et al.  Chromate Resistance Mechanisms in Leucobacter chromiiresistens , 2018, Applied and Environmental Microbiology.

[27]  Maureen L. Coleman,et al.  Genome-scale fitness profile of Caulobacter crescentus grown in natural freshwater , 2018, bioRxiv.

[28]  R. Knight,et al.  Gut microbiota utilize immunoglobulin A for mucosal colonization , 2018, Science.

[29]  P. Poole,et al.  Rhizobia: from saprophytes to endosymbionts , 2018, Nature Reviews Microbiology.

[30]  J. Clardy,et al.  Quinones are growth factors for the human gut microbiota , 2017, Microbiome.

[31]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[32]  Robert D. Finn,et al.  EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies , 2017, Nucleic Acids Res..

[33]  N. Fierer Embracing the unknown: disentangling the complexities of the soil microbiome , 2017, Nature Reviews Microbiology.

[34]  Yang Zhang,et al.  NeBcon: protein contact map prediction using neural network training coupled with naïve Bayes classifiers , 2017, Bioinform..

[35]  Georgios A. Pavlopoulos,et al.  Protein structure determination using metagenome sequence data , 2017, Science.

[36]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[37]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[38]  Martin J. Blaser,et al.  Helminth infection promotes colonization resistance via type 2 immunity , 2016, Science.

[39]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[40]  Yang Zhang,et al.  The I-TASSER Suite: protein structure and function prediction , 2014, Nature Methods.

[41]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[42]  Jeroen Krijgsveld,et al.  Ultrasensitive proteome analysis using paramagnetic bead technology , 2014, Molecular systems biology.

[43]  Dong Xu,et al.  FFAS-3D: improving fold recognition by including optimized structural features and template re-ranking , 2014, Bioinform..

[44]  Yang Zhang,et al.  Ab initio protein structure assembly using continuous structure fragments and optimized knowledge‐based force field , 2012, Proteins.

[45]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[46]  Yang Zhang,et al.  Atomic-level protein structure refinement using fragment-guided molecular dynamics conformation sampling. , 2011, Structure.

[47]  Yaoqi Zhou,et al.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates , 2011, Bioinform..

[48]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[49]  Yang Zhang Progress and challenges in protein structure prediction. , 2008, Current opinion in structural biology.

[50]  Sitao Wu,et al.  LOMETS: A local meta-threading-server for protein structure prediction , 2007, Nucleic acids research.

[51]  K. Weber,et al.  Microorganisms pumping iron: anaerobic microbial iron oxidation and reduction , 2006, Nature Reviews Microbiology.

[52]  E. Purdom,et al.  Diversity of the Human Intestinal Microbial Flora , 2005, Science.

[53]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[54]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[55]  Yang Zhang,et al.  SPICKER: A clustering approach to identify near‐native protein folds , 2004, J. Comput. Chem..

[56]  Dong Xu,et al.  PROSPECT II: protein structure prediction program for genome-scale applications. , 2003, Protein engineering.

[57]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[58]  R J Read,et al.  Crystallography & NMR system: A new software suite for macromolecular structure determination. , 1998, Acta crystallographica. Section D, Biological crystallography.

[59]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[60]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..