Literature mining of host-pathogen interactions: comparing feature-based supervised learning and language-based approaches

MOTIVATION In an infectious disease, the pathogen's strategy to enter the host organism and breach its immune defenses often involves interactions between the host and pathogen proteins. Currently, the experimental data on host-pathogen interactions (HPIs) are scattered across multiple databases, which are often specialized to target a specific disease or host organism. An accurate and efficient method for the automated extraction of HPIs from biomedical literature is crucial for creating a unified repository of HPI data. RESULTS Here, we introduce and compare two new approaches to automatically detect whether the title or abstract of a PubMed publication contains HPI data, and extract the information about organisms and proteins involved in the interaction. The first approach is a feature-based supervised learning method using support vector machines (SVMs). The SVM models are trained on the features derived from the individual sentences. These features include names of the host/pathogen organisms and corresponding proteins or genes, keywords describing HPI-specific information, more general protein-protein interaction information, experimental methods and other statistical information. The language-based method employed a link grammar parser combined with semantic patterns derived from the training examples. The approaches have been trained and tested on manually curated HPI data. When compared to a naïve approach based on the existing protein-protein interaction literature mining method, our approaches demonstrated higher accuracy and recall in the classification task. The most accurate, feature-based, approach achieved 66-73% accuracy, depending on the test protocol.

[1]  Rafael C. Jimenez,et al.  The IntAct molecular interaction database in 2012 , 2011, Nucleic Acids Res..

[2]  Yu Xia,et al.  Structural principles within the human-virus protein-protein interaction network , 2011, Proceedings of the National Academy of Sciences.

[3]  Bindu Nanduri,et al.  HPIDB - a unified resource for host-pathogen interactions , 2010, BMC Bioinformatics.

[4]  T. M. Murali,et al.  The Human-Bacterial Pathogen Protein Interaction Networks of Bacillus anthracis, Francisella tularensis, and Yersinia pestis , 2010, PloS one.

[5]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .

[6]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[7]  Livia Perfetto,et al.  MINT, the molecular interaction database: 2009 update , 2009, Nucleic Acids Res..

[8]  Hongfei Lin,et al.  BioPPIExtractor: A protein-protein interaction extraction system for biomedical literature , 2009, Expert Syst. Appl..

[9]  T. M. Murali,et al.  PIG—the pathogen interaction gateway , 2008, Nucleic Acids Res..

[10]  Hongfang Liu,et al.  Document Classification for Mining Host Pathogen Protein-Protein Interactions , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[11]  R. König,et al.  Global Analysis of Host-Pathogen Interactions that Regulate Early-Stage HIV-1 Replication , 2008, Cell.

[12]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[13]  Minlie Huang,et al.  Mining physical protein-protein interactions from the literature , 2008, Genome Biology.

[14]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[15]  Hodong Lee,et al.  E3Miner: a text mining tool for ubiquitin-protein ligases , 2008, Nucleic Acids Res..

[16]  Byoung-Tak Zhang,et al.  PIE: an online prediction system for protein–protein interactions from text , 2008, Nucleic Acids Res..

[17]  Narayanan Eswar,et al.  Host–pathogen protein interactions predicted by comparative modeling , 2007, Protein science : a publication of the Protein Society.

[18]  Christopher J. Rawlings,et al.  PHI-base update: additions to the pathogen–host interaction database , 2007, Nucleic Acids Res..

[19]  T. M. Murali,et al.  Computational prediction of host-pathogen protein-protein interactions , 2007, ISMB/ECCB.

[20]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[21]  Tapio Salakoski,et al.  Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches , 2006, BMC Bioinformatics.

[22]  Dominic Abrams,et al.  Language, Speech, and Communication , 2006 .

[23]  Hagit Shatkay,et al.  Discovering semantic features in the literature: a foundation for building functional associations , 2006, BMC Bioinformatics.

[24]  Javed Mostafa,et al.  A hybrid approach to protein name identification in biomedical texts , 2005, Inf. Process. Manag..

[25]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[26]  Hasan Davulcu,et al.  IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text , 2005, LBLODMBS@IDMB.

[27]  K. E. Ravikumar,et al.  Literature mining and database annotation of protein phosphorylation using a rule-based system , 2005, Bioinform..

[28]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[29]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[30]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[31]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[32]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[33]  Tapio Salakoski,et al.  Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions , 2004, NLPBA/BioNLP.

[34]  Anton Meinhart,et al.  Recognition of RNA polymerase II carboxy-terminal domain by 3′-RNA-processing factors , 2004, Nature.

[35]  Takashi Takenouchi,et al.  Statistical Learning Theory by Boosting Method , 2004 .

[36]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[37]  Hsinchun Chen,et al.  Filling Preposition-Based Templates to Capture Information from Medical Abstracts , 2001, Pacific Symposium on Biocomputing.

[38]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[39]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[40]  Javed Mostafa,et al.  Detecting Gene Relations from MEDLINE Abstracts , 2000, Pacific Symposium on Biocomputing.

[41]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[42]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[43]  Jerry R. Hobbs Resolving pronoun references , 1986 .

[44]  R. May,et al.  Population biology of infectious diseases: Part II , 1979, Nature.

[45]  The UniProt Consortium The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Res..

[46]  Maria Victoria Schneider,et al.  MINT: a Molecular INTeraction database. , 2002, FEBS letters.

[47]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[48]  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm115 Data and text mining , 2006 .

[49]  Burkhard Rost,et al.  Protein names precisely peeled off free text , 2004, ISMB/ECCB.

[50]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[51]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[52]  C. Blaschke,et al.  The potential use of SUISEKI as a protein interaction discovery tool. , 2001, Genome informatics. International Conference on Genome Informatics.

[53]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[54]  G. Mandell,et al.  New and emerging infectious diseases. , 1998, Transactions of the American Clinical and Climatological Association.

[55]  R. May,et al.  Population Biology of Infectious Diseases , 1982, Dahlem Workshop Reports.

[56]  Carlos Santos,et al.  Data and text mining Wnt pathway curation using automated natural language processing : combining statistical methods with partial and full parse for knowledge extraction , 2005 .

[57]  Minlie Huang,et al.  Bioinformatics Original Paper Discovering Patterns to Extract Protein–protein Interactions from the Literature: Part Ii , 2022 .