Evaluation of two dependency parsers on biomedical corpus targeted at protein-protein interactions

We present an evaluation of Link Grammar and Connexor Machinese Syntax, two major broad-coverage dependency parsers, on a custom hand-annotated corpus consisting of sentences regarding protein-protein interactions. In the evaluation, we apply the notion of an interaction subgraph, which is the subgraph of a dependency graph expressing a protein-protein interaction. We measure the performance of the parsers for recovery of individual dependencies, fully correct parses, and interaction subgraphs. For Link Grammar, an open system that can be inspected in detail, we further perform a comprehensive failure analysis, report specific causes of error, and suggest potential modifications to the grammar. We find that both parsers perform worse on biomedical English than previously reported on general English. While Connexor Machinese Syntax significantly outperforms Link Grammar, the failure analysis suggests specific ways in which the latter could be modified for better performance in the domain.

[1]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[2]  Tapio Salakoski,et al.  Extracting Protein-Protein Interaction Sentences by Applying Rough Set Data Analysis , 2004, Rough Sets and Current Trends in Computing.

[3]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[4]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[5]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[6]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[7]  Pasi Tapanainen Parsing in two frameworks: finite-state and functional dependency grammar , 1999 .

[8]  Tapio Salakoski,et al.  Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions , 2004, NLPBA/BioNLP.

[9]  Thierry Hamon,et al.  Event-Based Information Extraction for the Biomedical Domain: the Caderige Project , 2004, NLPBA/BioNLP.

[10]  Michael Collins,et al.  A Statistical Parser for Czech , 1999, ACL.

[11]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[12]  DaraseliaNikolai,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004 .

[13]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[14]  Nigel Collier,et al.  Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications , 2004 .

[15]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[16]  Mirella Lapata,et al.  Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics , 1999, ACL 1999.

[17]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[18]  Daniel Dominic Sleator,et al.  Parsing English with a Link Grammar , 1995, IWPT.

[19]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[20]  Peter Szolovits,et al.  Adding a Medical Lexicon to an English Parser , 2003, AMIA.

[21]  Ralph Grishman Proceedings of the fifth conference on Applied natural language processing , 1997 .

[22]  Jun Xu,et al.  Extracting biochemical interactions from MEDLINE using a link grammar parser , 2003, Proceedings. 15th IEEE International Conference on Tools with Artificial Intelligence.

[23]  Ioannis Xenarios,et al.  DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions , 2002, Nucleic Acids Res..