PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

The MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations.

[1]  Zhongming Zhao,et al.  Proteome-Scale Investigation of Protein Allosteric Regulation Perturbed by Somatic Mutations in 7,000 Cancer Genomes. , 2017, American journal of human genetics.

[2]  Peter N. Robinson,et al.  Deep phenotyping for precision medicine , 2012, Human mutation.

[3]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[4]  Peter M. A. Sloot,et al.  A hybrid approach to extract protein-protein interactions , 2011, Bioinform..

[5]  Barbara Rosario,et al.  Classifying Semantic Relations in Bioscience Texts , 2004, ACL.

[6]  Morteza Pourreza Shahri,et al.  ProPheno 1.0: An Online Dataset for Accelerating the Complete Characterization of the Human Protein-Phenotype Landscape in Biomedical Literature , 2019, 2020 IEEE 14th International Conference on Semantic Computing (ICSC).

[7]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[8]  Yifan Peng,et al.  Extracting chemical–protein relations with ensembles of SVM and deep learning models , 2018, Database J. Biol. Databases Curation.

[9]  Morteza Pourreza Shahri,et al.  Extracting Co-mention Features from Biomedical Literature for Automated Protein Phenotype Prediction using PHENOstruct , 2018 .

[10]  Morteza Pourreza Shahri,et al.  ProPheno 1.0: An Online Dataset for Accelerating the Complete Characterization of the Human Protein-Phenotype Landscape in Biomedical Literature , 2020, ICSC.

[11]  Robert E. Mercer,et al.  Identifying genotype-phenotype relationships in biomedical text , 2017, J. Biomed. Semant..

[12]  Zhiyong Lu,et al.  Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine , 2016, PLoS Comput. Biol..

[13]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[14]  Biocuration: Distilling data into knowledge , 2018, PLoS biology.

[15]  Hamidreza Chitsaz,et al.  SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature , 2017, Journal of Biomedical Semantics.

[16]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[17]  Peter W. Harrison,et al.  The evolution of gene expression and the transcriptome-phenotype relationship. , 2012, Seminars in cell & developmental biology.

[18]  Zhiyong Lu,et al.  Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature , 2016, J. Am. Medical Informatics Assoc..

[19]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[20]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[21]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[22]  Pieter W. Adriaans,et al.  Learning Relations from Biomedical Corpora Using Dependency Trees , 2006, KDECB.

[23]  Hongfang Liu,et al.  BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences , 2017, Database J. Biol. Databases Curation.

[24]  A. Lamond,et al.  Multidimensional proteomics for cell biology , 2015, Nature Reviews Molecular Cell Biology.

[25]  George Hripcsak,et al.  Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[26]  Russ B. Altman,et al.  Author ' s personal copy Using text to build semantic networks for pharmacogenomics , 2010 .

[27]  Claudio Soto,et al.  Molecular interaction between type 2 diabetes and Alzheimer’s disease through cross-seeding of protein misfolding , 2016, Molecular Psychiatry.

[28]  Tingting Zhao,et al.  Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering , 2019, Database.

[29]  Francisco M. Couto,et al.  Extracting microRNA-gene relations from biomedical literature using distant supervision , 2017, PloS one.

[30]  Mark Craven,et al.  Learning to Extract Relations from MEDLINE , 1999 .

[31]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[32]  Raja Mazumder,et al.  DiMeX: A Text Mining System for Mutation-Disease Association Extraction , 2016, PloS one.

[33]  Robert Hoehndorf,et al.  Ontology based text mining of gene-phenotype associations: application to candidate gene prediction , 2019, Database J. Biol. Databases Curation.

[34]  C. Dobson,et al.  Protein Misfolding, Amyloid Formation, and Human Disease: A Summary of Progress Over the Last Decade. , 2017, Annual review of biochemistry.

[35]  Salvador Ventura,et al.  Protein misfolding diseases , 2015, Future science OA.

[36]  Yijia Zhang,et al.  A hybrid model based on neural networks for biomedical relation extraction , 2018, J. Biomed. Informatics.

[37]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[38]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[39]  P. Alam,et al.  Protein misfolding and aggregation: Mechanism, factors and detection , 2016 .

[40]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[41]  Jinfeng Zhang,et al.  Bayesian inference of protein-protein interactions from biological literature , 2009, Bioinform..