UPCLASS: a deep learning-based classifier for UniProtKB entry publications

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliography in UniProt, we investigate a Convolution Neural Network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge to categorize publications at the accession annotation level is that the same publication can be annotated with multiple proteins, and thus be associated to different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a F1-score of 0.72, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession.

[1]  Burkhard Rost,et al.  tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles , 2014, Database J. Biol. Databases Curation.

[2]  Hagit Shatkay,et al.  An effective biomedical document classification scheme in support of biocuration: addressing class imbalance , 2019, Database J. Biol. Databases Curation.

[3]  Burkhard Rost,et al.  LocText: relation extraction of protein localizations to assist database curation , 2018, BMC Bioinformatics.

[4]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[5]  Nanyun Peng,et al.  Building deep learning models for evidence classification from the open access biomedical literature , 2019, Database J. Biol. Databases Curation.

[6]  Zhiyong Lu,et al.  Scaling up data curation using deep learning: An application to literature triage in genomic variation resources , 2018, PLoS Comput. Biol..

[7]  Biocuration: Distilling data into knowledge , 2018, PLoS biology.

[8]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[9]  Christian Lovis,et al.  Automatic IPC Encoding and Novelty Tracking for Effective Patent Mining , 2010, NTCIR.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Zhiyong Lu,et al.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study , 2017, Bioinform..

[12]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[13]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[14]  Midori A. Harris,et al.  Model organism databases: essential resources that need the support of both funders and users , 2016, BMC Biology.

[15]  Zhiyong Lu,et al.  Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges , 2016, Database J. Biol. Databases Curation.

[16]  Wolfgang Wahlster,et al.  New Horizons for a Data-Driven Economy , 2016, Springer International Publishing.

[17]  P. Ruch,et al.  Assisted Knowledge Discovery for the Maintenance of Clinical Guidelines , 2013, PloS one.

[18]  Cathy H. Wu,et al.  eGenPub, a text mining system for extending computationally mapped bibliography for UniProt Knowledgebase by capturing centrality , 2017, Database J. Biol. Databases Curation.

[19]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[20]  Kimberly Van Auken,et al.  Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature , 2018, BMC Bioinformatics.

[21]  Patrick Ruch,et al.  Customizing a Variant Annotation-Support Tool: an Inquiry into Probability Ranking Principles for TREC Precision Medicine , 2017, TREC.

[22]  Christian Lovis,et al.  Automatic Prior Art Searching and Patent Encoding at CLEF-IP '10 , 2010, CLEF.

[23]  Zhiyong Lu,et al.  Community challenges in biomedical text mining over 10 years: success, failure and the future , 2016, Briefings Bioinform..

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Thérèse Vachon,et al.  Improving average ranking precision in user searches for biomedical research datasets , 2017, Database J. Biol. Databases Curation.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Renée J. Miller Big Data Curation , 2014, COMAD.

[28]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[29]  Christian Simon,et al.  BioReader: a text mining tool for performing classification of biomedical literature , 2019, BMC Bioinformatics.