Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition

High quality metadata annotations for data hosted in large public repositories are essential for research reproducibility, and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information’s (NCBI’s) Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via Named Entity Recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic (AUROC) curve of 85.2% and 0.977 respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Availability All the analyses, environments, and Jupyter notebooks pertaining to this manuscript are available on Github: https://github.com/cartercompbio/PredictMEE.

[1]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[2]  Mark A. Musen,et al.  Aligning Biomedical Metadata with Ontologies Using Clustering and Embeddings , 2019, ESWC.

[3]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[4]  Syed Ahmad Chan Bukhari,et al.  Adaptive Immune Receptor Repertoire Community recommendations for sharing immune-repertoire sequencing data , 2017, Nature Immunology.

[5]  AnHai Doan,et al.  MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive , 2017, Bioinform..

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[8]  Kei-Hoi Cheung,et al.  CEDAR OnDemand: a browser extension to generate ontology-based scientific metadata , 2018, BMC Bioinform..

[9]  Kathleen M Jagodnik,et al.  Massive mining of publicly available RNA-seq data from human and mouse , 2017, Nature Communications.

[10]  Alvis Brazma,et al.  Minimum Information About a Microarray Experiment (MIAME) – Successes, Failures, Challenges , 2009, TheScientificWorldJournal.

[11]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[12]  Jeffrey T Leek,et al.  Reproducible RNA-seq analysis using recount2 , 2017, Nature Biotechnology.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Marco Brandizi,et al.  The BioSample Database (BioSD) at the European Bioinformatics Institute , 2011, Nucleic Acids Res..

[15]  Mark A. Musen,et al.  The variable quality of metadata about biological samples used in biomedical experiments , 2018, Scientific Data.

[16]  SchmidhuberJürgen,et al.  2005 Special Issue , 2005 .

[17]  Avi Ma'ayan,et al.  Mining data and metadata from the gene expression omnibus , 2018, Biophysical Reviews.

[18]  Wei Hu,et al.  Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata , 2017, BMC Bioinformatics.

[19]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[20]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.