Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes

Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.

[1]  Zhiyong Lu,et al.  Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine , 2016, PLoS Comput. Biol..

[2]  Peter M. A. Sloot,et al.  A hybrid approach to extract protein-protein interactions , 2011, Bioinform..

[3]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[4]  Tudor Groza,et al.  Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources , 2018, Nucleic Acids Res..

[5]  Morteza Pourreza Shahri,et al.  ProPheno 1.0: An Online Dataset for Accelerating the Complete Characterization of the Human Protein-Phenotype Landscape in Biomedical Literature , 2019, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).

[6]  Karin M. Verspoor,et al.  PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources , 2015, F1000Research.

[7]  Yijia Zhang,et al.  A hybrid model based on neural networks for biomedical relation extraction , 2018, J. Biomed. Informatics.

[8]  Tudor Groza,et al.  The Human Phenotype Ontology in 2017 , 2016, Nucleic Acids Res..

[9]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[10]  Zhiyong Lu,et al.  Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature , 2016, J. Am. Medical Informatics Assoc..

[11]  Tao Huang,et al.  Analysis of cancer-related lncRNAs using gene ontology and KEGG pathways , 2017, Artif. Intell. Medicine.

[12]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[13]  Tao Huang,et al.  Identification of Chronic Hypersensitivity Pneumonitis Biomarkers with Machine Learning and Differential Co-expression Analysis. , 2020, Current gene therapy.

[14]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[15]  Zhiyong Lu,et al.  Generalizing biomedical relation classification with neural adversarial domain adaptation , 2018, Bioinform..

[16]  Robi Polikar Ensemble learning , 2009, Scholarpedia.

[17]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[18]  Xiaolong Wang,et al.  Drug-Drug Interaction Extraction via Convolutional Neural Networks , 2016, Comput. Math. Methods Medicine.

[19]  Yu-Dong Cai,et al.  Inferring novel genes related to oral cancer with a network embedding method and one-class learning algorithms , 2019, Gene Therapy.

[20]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[21]  Tingting Zhao,et al.  Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering , 2019, Database.

[22]  Sunil Kumar Sahu,et al.  Drug-Drug Interaction Extraction from Biomedical Text Using Long Short Term Memory Network , 2017, J. Biomed. Informatics.

[23]  Hiroshi Mamitsuka,et al.  AiProAnnotator: Low-rank Approximation with network side information for high-performance, large-scale human Protein abnormality Annotator , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[24]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[25]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[26]  The Gene Ontology Consortium,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2018, Nucleic Acids Res..

[27]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[28]  Morteza Pourreza Shahri,et al.  PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature , 2019, BCB.

[29]  Peter N. Robinson,et al.  Deep phenotyping for precision medicine , 2012, Human mutation.

[30]  Lei Chen,et al.  Inferring Novel Tumor Suppressor Genes with a Protein-Protein Interaction Network and Network Diffusion Algorithms , 2018, Molecular therapy. Methods & clinical development.

[31]  M. King,et al.  Breast and Ovarian Cancer Risks Due to Inherited Mutations in BRCA1 and BRCA2 , 2003, Science.

[32]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[33]  Chen Lin,et al.  Self-training improves Recurrent Neural Networks performance for Temporal Relation Extraction , 2018, Louhi@EMNLP.

[34]  Francisco M. Couto,et al.  Extracting microRNA-gene relations from biomedical literature using distant supervision , 2017, PloS one.

[35]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[36]  Xiaodi Huang,et al.  HPOAnnotator: improving large-scale prediction of HPO annotations by low-rank approximation with HPO semantic similarities and multiple PPI networks , 2019, BMC Medical Genomics.

[37]  Giorgio Valentini,et al.  Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods , 2017, BMC Bioinformatics.

[38]  Tao Huang,et al.  Decipher the connections between proteins and phenotypes. , 2020, Biochimica et biophysica acta. Proteins and proteomics.

[39]  Tao Huang,et al.  Identification of Cell Cycle-Regulated Genes by Convolutional Neural Network. , 2017, Combinatorial chemistry & high throughput screening.

[40]  Julian Peto,et al.  Identification of the breast cancer susceptibility gene BRCA2 , 1996, Nature.

[41]  James C. Hu,et al.  The Gene Ontology Resource: 20 years and still GOing strong , 2019 .

[42]  Gavin Brown,et al.  Ensemble Learning , 2010, Encyclopedia of Machine Learning and Data Mining.

[43]  Yu-Dong Cai,et al.  A computational method using the random walk with restart algorithm for identifying novel epigenetic factors , 2018, Molecular Genetics and Genomics.

[44]  Robert E. Mercer,et al.  Identifying genotype-phenotype relationships in biomedical text , 2017, J. Biomed. Semant..

[45]  George Hripcsak,et al.  Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[46]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[47]  D. Bentley,et al.  Identification of the breast cancer susceptibility gene BRCA2 , 1995, Nature.

[48]  Morteza Pourreza Shahri,et al.  DeepPPPred: An Ensemble of BERT, CNN, and RNN for Classifying Co-mentions of Proteins and Phenotypes , 2020 .

[49]  Tao Huang,et al.  A network-based method using a random walk with restart algorithm and screening tests to identify novel genes associated with Menière's disease , 2017, PloS one.

[50]  Yifan Peng,et al.  Deep learning for extracting protein-protein interactions from biomedical literature , 2017, BioNLP.

[51]  Hongfang Liu,et al.  BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences , 2017, Database J. Biol. Databases Curation.

[52]  Hiroshi Mamitsuka,et al.  HPOFiller: identifying missing protein-phenotype associations by graph convolutional network , 2021, Bioinform..

[53]  D. Scott Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics , 2004 .

[54]  Raja Mazumder,et al.  DiMeX: A Text Mining System for Mutation-Disease Association Extraction , 2016, PloS one.

[55]  Maxat Kulmanov,et al.  DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier , 2020 .

[56]  Morteza Pourreza Shahri,et al.  Extracting Co-mention Features from Biomedical Literature for Automated Protein Phenotype Prediction using PHENOstruct , 2018 .

[57]  Peter W. Harrison,et al.  The evolution of gene expression and the transcriptome-phenotype relationship. , 2012, Seminars in cell & developmental biology.

[58]  Sung-Pil Choi,et al.  Extraction of protein–protein interactions (PPIs) from the literature by deep convolutional neural networks with various feature embeddings , 2018, J. Inf. Sci..

[59]  Hamidreza Chitsaz,et al.  SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature , 2017, Journal of Biomedical Semantics.

[60]  Maxat Kulmanov,et al.  DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier , 2020, PLoS computational biology.

[61]  Jaewoo Kang,et al.  Chemical–gene relation extraction using recursive neural network , 2018, Database J. Biol. Databases Curation.

[62]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[63]  Xiaogang Wang,et al.  Deep Self-Learning From Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Biocuration: Distilling data into knowledge , 2018, PLoS biology.

[65]  Manuel Corpas,et al.  DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. , 2009, American journal of human genetics.

[66]  Morteza Pourreza Shahri,et al.  PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature , 2019, bioRxiv.

[67]  Hongfang Liu,et al.  Extracting chemical–protein relations using attention-based neural networks , 2018, Database J. Biol. Databases Curation.

[68]  Yifan Peng,et al.  Extracting chemical–protein relations with ensembles of SVM and deep learning models , 2018, Database J. Biol. Databases Curation.

[69]  Tunca Doğan,et al.  HPO2GO: prediction of human phenotype ontology term associations for proteins using cross ontology annotation co-occurrences , 2018, PeerJ.

[70]  Xiaodi Huang,et al.  HPOLabeler: improving prediction of human protein-phenotype associations by learning to rank , 2020, Bioinform..

[71]  Yu-Dong Cai,et al.  Prediction and analysis of essential genes using the enrichments of gene ontology and KEGG pathways , 2017, PloS one.