ProPheno 1.0: An Online Dataset for Accelerating the Complete Characterization of the Human Protein-Phenotype Landscape in Biomedical Literature

Identifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. One of the best resources that captures the protein-phenotype relationships is the biomedical literature. In this work, we introduce ProPheno, a comprehensive online dataset composed of human protein/phenotype mentions extracted from the complete corpora of Medline and PubMed Central Open Access. Moreover, it includes co-occurrences of protein-phenotype pairs within different spans of text such as sentences and paragraphs. We use ProPheno for completely characterizing the human protein-phenotype landscape in biomedical literature. ProPheno, the reported findings and the gained insight has implications for (1) biocurators for expediting their curation efforts, (2) researches for quickly finding relevant articles, and (3) text mining tool developers for training their predictive models. The RESTful API of ProPheno is freely available at http://propheno.cs.montana.edu.

[1]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[2]  Robert E. Mercer,et al.  Identifying genotype-phenotype relationships in biomedical text , 2017, J. Biomed. Semant..

[3]  Zhiyong Lu,et al.  Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine , 2016, PLoS Comput. Biol..

[4]  Dietrich Rebholz-Schuhmann,et al.  Between proteins and phenotypes: annotation and interpretation of mutations , 2009, BMC Bioinformatics.

[5]  Damian Smedley,et al.  The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data , 2014, Nucleic Acids Res..

[6]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[7]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[8]  Morteza Pourreza Shahri,et al.  PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature , 2019, bioRxiv.

[9]  Morteza Pourreza Shahri,et al.  Extracting Co-mention Features from Biomedical Literature for Automated Protein Phenotype Prediction using PHENOstruct , 2018 .

[10]  Biocuration: Distilling data into knowledge , 2018, PLoS biology.

[11]  Manuel Corpas,et al.  DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. , 2009, American journal of human genetics.

[12]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[13]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[14]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[15]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[16]  Peter N. Robinson,et al.  Deep phenotyping for precision medicine , 2012, Human mutation.

[17]  Francisco M. Couto,et al.  Extracting microRNA-gene relations from biomedical literature using distant supervision , 2017, PloS one.

[18]  Nigel Collier,et al.  Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora , 2015, Database J. Biol. Databases Curation.

[19]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[20]  Karin M. Verspoor,et al.  PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources , 2015, F1000Research.

[21]  Diego Martínez Hernández,et al.  Automated semantic annotation of rare disease cases: a case study , 2014, Database J. Biol. Databases Curation.

[22]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..