A machine learning framework for discovery and enrichment of metagenomics metadata from open access publications

Abstract Metagenomics is a culture-independent method for studying the microbes inhabiting a particular environment. Comparing the composition of samples (functionally/taxonomically), either from a longitudinal study or cross-sectional studies, can provide clues into how the microbiota has adapted to the environment. However, a recurring challenge, especially when comparing results between independent studies, is that key metadata about the sample and molecular methods used to extract and sequence the genetic material are often missing from sequence records, making it difficult to account for confounding factors. Nevertheless, these missing metadata may be found in the narrative of publications describing the research. Here, we describe a machine learning framework that automatically extracts essential metadata for a wide range of metagenomics studies from the literature contained in Europe PMC. This framework has enabled the extraction of metadata from 114,099 publications in Europe PMC, including 19,900 publications describing metagenomics studies in European Nucleotide Archive (ENA) and MGnify. Using this framework, a new metagenomics annotations pipeline was developed and integrated into Europe PMC to regularly enrich up-to-date ENA and MGnify metagenomics studies with metadata extracted from research articles. These metadata are now available for researchers to explore and retrieve in the MGnify and Europe PMC websites, as well as Europe PMC annotations API.

[1]  L. Jensen,et al.  PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types , 2022, Microorganisms.

[2]  Chris I. Hunter,et al.  Reporting guidelines for human microbiome research: the STORMS checklist , 2021, Nature Medicine.

[3]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[4]  Peter F. Stadler,et al.  HumanMetagenomeDB: a public repository of curated and standardized metadata for human metagenomes , 2020, Nucleic Acids Res..

[5]  J. Mcentyre,et al.  Europe PMC in 2020 , 2020, Nucleic Acids Res..

[6]  Rodrigo Lopez,et al.  The European Nucleotide Archive in 2020 , 2020, Nucleic Acids Res..

[7]  I-Min A. Chen,et al.  Genomes OnLine Database (GOLD) v.8: overview and updates , 2020, Nucleic Acids Res..

[8]  D. Brenner,et al.  A Universal Gut-Microbiome-Derived Signature Predicts Cirrhosis. , 2020, Cell Metabolism.

[9]  R. Knight,et al.  Host variables confound gut microbiota studies of human disease , 2020, Nature.

[10]  Jason Alan Fries,et al.  Ontology-driven weak supervision for clinical entity classification in electronic health records , 2020, Nature Communications.

[11]  J. Linseisen,et al.  Arrhythmic Gut Microbiome Signatures Predict Risk of Type 2 Diabetes. , 2020, Cell host & microbe.

[12]  Hamid Bagheri,et al.  Detecting and correcting misclassified sequences in the large-scale public databases , 2020, Bioinform..

[13]  M. Armenteros,et al.  Microbial signatures of protected and impacted Northern Caribbean reefs: changes from Cuba to the Florida Keys , 2019, Environmental microbiology.

[14]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Henning Hermjakob,et al.  BioModels—15 years of sharing computational models in life science , 2019, Nucleic Acids Res..

[16]  P. Stadler,et al.  TerrestrialMetagenomeDB: a public repository of curated and standardized metadata for terrestrial metagenomes , 2019, bioRxiv.

[17]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[18]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[19]  Kai Xu,et al.  Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition , 2019, Comput. Biol. Medicine.

[20]  Jennifer M. Fettweis,et al.  The Integrative Human Microbiome Project , 2019, Nature.

[21]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[22]  Gary D. Bader,et al.  Transfer learning for biomedical named entity recognition with neural networks , 2018, bioRxiv.

[23]  Hongfei Lin,et al.  An attention‐based BiLSTM‐CRF approach to document‐level chemical named entity recognition , 2018, Bioinform..

[24]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[25]  Robert D. Finn,et al.  EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies , 2017, Nucleic Acids Res..

[26]  Maryam Habibi,et al.  Deep learning with word embeddings improves biomedical named entity recognition , 2017, Bioinform..

[27]  Louise Deléger,et al.  Text mining tools for extracting information about microbial biodiversity in food , 2017, Food microbiology.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.

[30]  I-Min A. Chen,et al.  IMG/M: integrated genome and metagenome comparative data analysis system , 2016, Nucleic Acids Res..

[31]  Quoc V. Le,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[32]  K. Pollard,et al.  Toward Accurate and Quantitative Comparative Metagenomics , 2016, Cell.

[33]  Andreas Wilke,et al.  The MG-RAST metagenomics database and portal in 2015 , 2015, Nucleic Acids Res..

[34]  Jens Roat Kultima,et al.  Disentangling the effects of type 2 diabetes and metformin on the human gut microbiota , 2016 .

[35]  Zhiyong Lu,et al.  BioCreative-IV virtual issue , 2014, Database J. Biol. Databases Curation.

[36]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[37]  Barry Smith,et al.  The environment ontology: contextualising biological and biomedical entities , 2013, Journal of Biomedical Semantics.

[38]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[39]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[40]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[41]  Emily S. Charlson,et al.  Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications , 2011, Nature Biotechnology.

[42]  Pelin Yilmaz,et al.  The genomic standards consortium: bringing standards to life for microbial ecology , 2011, The ISME Journal.

[43]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.