Towards Semantic Role Labeling & IE in the Medical Literature

INTRODUCTION In this work, we introduce the concept of semantic role labeling to the medical domain. We report first results of porting and adapting an existing resource, Propbank, to the medical field. Propbank is an adjunct to Penn Treebank that provides semantic annotation of predicates and the roles played by their arguments. The main aim of this work is the applicability of the Propbank frame files to predicates typically encountered in the medical literature. METHODS We analyzed a target corpus of 610,100 abstracts, which was selected by searching for publication type "case reports". From this target corpus, we randomly selected 10,000 sample abstracts to estimate the predicate distribution, and matched the predicates from this sample to the predicates in Propbank. RESULTS Of the 1998 unique verbs in our sample, 76% were represented in Propbank. This included the 40 most frequent verbs, which represented 49% of all predicate instances in our sample and which matched the Propbank usage in a study of representative sentences. We propose extensions to Propbank that handle medical predicates, which are not adequately covered by Propbank. CONCLUSION We believe that semantic role labeling using Propbank is a valid approach to capture predicate relations in the medical literature.

[1]  P J Haug,et al.  Experience with a mixed semantic/syntactic parser. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[2]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[3]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[4]  Daniel Jurafsky,et al.  Shallow Semantic Parsing using Support Vector Machines , 2004, NAACL.

[5]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[6]  Martha Palmer,et al.  From TreeBank to PropBank , 2002, LREC.

[7]  Padmini Srinivasan,et al.  Exploring text mining from MEDLINE , 2002, AMIA.

[8]  George Hripcsak,et al.  The sublanguage of cross-coverage , 2002, AMIA.

[9]  Marcelo Fiszman,et al.  The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text , 2003, J. Biomed. Informatics.

[10]  Udo Hahn,et al.  Really, Is Medical Sublanguage That Different? Experimental Counter-evidence from Tagging Medical and Newspaper Corpora , 2004, MedInfo.

[11]  Martin Romacker,et al.  MedSynDikate - a natural language system for the extraction of medical information from findings reports , 2002, Int. J. Medical Informatics.

[12]  Nigel Collier,et al.  PASBio: predicate-argument structures for event extraction in molecular biology , 2004, BMC Bioinformatics.

[13]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[14]  Anne Abeillé,et al.  Treebanks: Building and Using Parsed Corpora , 2003 .