Natural language processing in text mining for structural modeling of protein complexes

BackgroundStructural modeling of protein-protein interactions produces a large number of putative configurations of the protein complexes. Identification of the near-native models among them is a serious challenge. Publicly available results of biomedical research may provide constraints on the binding mode, which can be essential for the docking. Our text-mining (TM) tool, which extracts binding site residues from the PubMed abstracts, was successfully applied to protein docking (Badal et al., PLoS Comput Biol, 2015; 11: e1004630). Still, many extracted residues were not relevant to the docking.ResultsWe present an extension of the TM tool, which utilizes natural language processing (NLP) for analyzing the context of the residue occurrence. The procedure was tested using generic and specialized dictionaries. The results showed that the keyword dictionaries designed for identification of protein interactions are not adequate for the TM prediction of the binding mode. However, our dictionary designed to distinguish keywords relevant to the protein binding sites led to considerable improvement in the TM performance. We investigated the utility of several methods of context analysis, based on dissection of the sentence parse trees. The machine learning-based NLP filtered the pool of the mined residues significantly more efficiently than the rule-based NLP. Constraints generated by NLP were tested in docking of unbound proteins from the DOCKGROUND X-ray benchmark set 4. The output of the global low-resolution docking scan was post-processed, separately, by constraints from the basic TM, constraints re-ranked by NLP, and the reference constraints. The quality of a match was assessed by the interface root-mean-square deviation. The results showed significant improvement of the docking output when using the constraints generated by the advanced TM with NLP.ConclusionsThe basic TM procedure for extracting protein-protein binding site residues from the PubMed abstracts was significantly advanced by the deep parsing (NLP techniques for contextual analysis) in purging of the initial pool of the extracted residues. Benchmarking showed a substantial increase of the docking success rate based on the constraints generated by the advanced TM with NLP.

[1]  Karin M. Verspoor,et al.  Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations , 2013, PloS one.

[2]  Yifan Peng,et al.  Extended dependency graph for BioC-compatible protein-protein interaction ( PPI ) passage detection in full-text articles , 2015 .

[3]  Ian M. Donaldson,et al.  Literature curation of protein interactions: measuring agreement across major public databases , 2010, Database J. Biol. Databases Curation.

[4]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[5]  Lingling Meng,et al.  A Review of Semantic Similarity Measures in WordNet 1 , 2013 .

[6]  Dominique Douguet,et al.  DOCKGROUND system of databases for protein recognition studies: Unbound structures for docking , 2007, Proteins.

[7]  Michael Pucher,et al.  Performance Evaluation of WordNet-based Semantic Relatedness Measures for Word Prediction in Conversational Speech , 2004 .

[8]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[9]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[10]  Lou Wave S Knecht,et al.  Mapping in PubMed. , 2002, Journal of the Medical Library Association : JMLA.

[11]  Karin M. Verspoor,et al.  Subgraph Matching-Based Literature Mining for Biomedical Relations and Events , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[12]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[13]  I. Vakser Low-resolution docking: prediction of complexes for underdetermined structures. , 1998, Biopolymers.

[14]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[15]  Razvan C. Bunescu,et al.  Subsequence Kernels for Relation Extraction , 2005, NIPS.

[16]  Jing Wang,et al.  Evaluation and integration of existing methods for computational prediction of allergens , 2013, BMC Bioinformatics.

[17]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[18]  Dragomir R. Radev,et al.  Extracting Interacting Protein Pairs and Evidence Sentences by using Dependency Parsing and Machine Learning Techniques , 2007 .

[19]  Deyu Zhou,et al.  Methodological Review: Extracting interactions between proteins from the literature , 2008 .

[20]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[21]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[22]  Petras J. Kundrotas,et al.  Text Mining for Protein Docking , 2015, PLoS Comput. Biol..

[23]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[24]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[25]  A. Bonvin,et al.  WHISCY: What information does surface conservation yield? Application to data‐driven docking , 2006, Proteins.

[26]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[27]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[28]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[29]  Ralf Zimmer,et al.  RelEx - Relation extraction using dependency parse trees , 2007, Bioinform..

[30]  W. John Wilbur,et al.  PIE the search: searching PubMed literature for protein interaction information , 2012, Bioinform..

[31]  Alessandro Moschitti,et al.  A Study on Convolution Kernels for Shallow Statistic Parsing , 2004, ACL.

[32]  Ulf Leser,et al.  Optimizing syntax patterns for discovering protein-protein interactions , 2005, SAC '05.

[33]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[34]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[35]  Dmitrij Frishman,et al.  Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis , 2013, Nucleic Acids Res..

[36]  Razvan C. Bunescu,et al.  A Shortest Path Dependency Kernel for Relation Extraction , 2005, HLT.

[37]  Hagit Shatkay,et al.  Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge , 2013, BMC Bioinformatics.

[38]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[39]  Dmitry Korkin,et al.  Literature mining of host-pathogen interactions: comparing feature-based supervised learning and language-based approaches , 2012, Bioinform..

[40]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[41]  Ilya A Vakser,et al.  Protein-protein docking: from interaction to interactome. , 2014, Biophysical journal.

[42]  David Baker,et al.  Scoring functions for protein-protein interactions. , 2013, Current opinion in structural biology.

[43]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[44]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[45]  David Sánchez,et al.  Enabling semantic similarity estimation across multiple ontologies: An evaluation in the biomedical domain , 2012, J. Biomed. Informatics.

[46]  Keun Ho Ryu,et al.  A Novel Approach for Protein-Named Entity Recognition and Protein-Protein Interaction Extraction , 2015 .

[47]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[48]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[49]  Kalpana Raja,et al.  PPInterFinder—a mining tool for extracting causal relations on human proteins from literature , 2013, Database J. Biol. Databases Curation.

[50]  Yifan Peng,et al.  An extended dependency graph for relation extraction in biomedical texts , 2015, BioNLP@IJCNLP.

[51]  M. Wang,et al.  An Unsupervised Text Mining Method for Relation Extraction from Biomedical Literature , 2014, PloS one.

[52]  John McNaught,et al.  A Term-Based Methodology for Template Creation in Information Extraction , 2000, Natural Language Processing.

[53]  Takenao Ohkawa,et al.  Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords , 2015, BioMed research international.

[54]  Alessandro Moschitti,et al.  Making Tree Kernels Practical for Natural Language Learning , 2006, EACL.

[55]  M. He,et al.  PPI Finder: A Mining Tool for Human Protein-Protein Interactions , 2009, PloS one.

[56]  Jinfeng Zhang,et al.  Bayesian inference of protein-protein interactions from biological literature , 2009, Bioinform..

[57]  Dragomir R. Radev,et al.  Semi-Supervised Classification for Extracting Protein Interaction Sentences using Dependency Parsing , 2007, EMNLP.

[58]  Karin M. Verspoor,et al.  Text Mining Improves Prediction of Protein Functional Sites , 2012, PloS one.

[59]  Dietrich Rebholz-Schuhmann,et al.  Measuring prediction capacity of individual verbs for the identification of protein interactions , 2010, J. Biomed. Informatics.

[60]  Kyu-Chul Lee,et al.  Extracting Protein-Protein Interactions in Biomedical Literature Using an Existing Syntactic Parser , 2006, KDLL.

[61]  Nguyen Ha Vo,et al.  Efficient Extraction of Protein-Protein Interactions from Full-Text Articles , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.