Acquisition and evaluation of verb subcategorization resources for biomedicine

BACKGROUND Biomedical natural language processing (NLP) applications that have access to detailed resources about the linguistic characteristics of biomedical language demonstrate improved performance on tasks such as relation extraction and syntactic or semantic parsing. Such applications are important for transforming the growing unstructured information buried in the biomedical literature into structured, actionable information. In this paper, we address the creation of linguistic resources that capture how individual biomedical verbs behave. We specifically consider verb subcategorization, or the tendency of verbs to "select" co-occurrence with particular phrase types, which influences the interpretation of verbs and identification of verbal arguments in context. There are currently a limited number of biomedical resources containing information about subcategorization frames (SCFs), and these are the result of either labor-intensive manual collation, or automatic methods that use tools adapted to a single biomedical subdomain. Either method may result in resources that lack coverage. Moreover, the quality of existing verb SCF resources for biomedicine is unknown, due to a lack of available gold standards for evaluation. RESULTS This paper presents three new resources related to verb subcategorization frames in biomedicine, and four experiments making use of the new resources. We present the first biomedical SCF gold standards, capturing two different but widely-used definitions of subcategorization, and a new SCF lexicon, BioCat, covering a large number of biomedical sub-domains. We evaluate the SCF acquisition methodologies for BioCat with respect to the gold standards, and compare the results with the accuracy of the only previously existing automatically-acquired SCF lexicon for biomedicine, the BioLexicon. Our results show that the BioLexicon has greater precision while BioCat has better coverage of SCFs. Finally, we explore the definition of subcategorization using these resources and its implications for biomedical NLP. All resources are made publicly available. CONCLUSION The SCF resources we have evaluated still show considerably lower accuracy than that reported with general English lexicons, demonstrating the need for domain- and subdomain-specific SCF acquisition tools for biomedicine. Our new gold standards reveal major differences when annotators use the different definitions. Moreover, evaluation of BioCat yields major differences in accuracy depending on the gold standard, demonstrating that the definition of subcategorization adopted will have a direct impact on perceived system accuracy for specific tasks.

[1]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[2]  Jun'ichi Tsujii,et al.  An Intelligent Search Engine and GUI-based Efficient MEDLINE Search Tool Based on Deep Syntactic Parsing , 2006, ACL.

[3]  Wen-Lian Hsu,et al.  Semi-automatic conversion of BioProp semantic annotation to PASBio annotation , 2008, BMC Bioinformatics.

[4]  Sophia Ananiadou,et al.  Bootstrapping a Verb Lexicon for Biomedical Information Extraction , 2009, CICLing.

[5]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[6]  Karin M. Verspoor,et al.  Approaches to verb subcategorization for biomedicine , 2013, J. Biomed. Informatics.

[7]  Ralph Grishman,et al.  Comlex Syntax: Building a Computational Lexicon , 1994, COLING.

[8]  Sophia Ananiadou,et al.  A Specialised Verb Lexicon as the Basis of Fact Extraction in the Biomedical Domain , 2010 .

[9]  Dietrich Rebholz-Schuhmann,et al.  The BioLexicon: a large-scale terminological resource for biomedical text mining , 2011, BMC Bioinformatics.

[10]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[11]  Anna Korhonen,et al.  Learning Syntactic Verb Frames using Graphical Models , 2012, ACL.

[12]  K. Bretonnel Cohen,et al.  A critical review of PASBio's argument structures for biomedical verbs , 2006, BMC Bioinformatics.

[13]  Dietrich Rebholz-Schuhmann,et al.  BioLexicon: Towards a Reference Terminological Resource in the Biomedical Domain , 2008, ISMB 2008.

[14]  Sophia Ananiadou,et al.  Improving search through Event-based Biomedical Text Mining , 2010 .

[15]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[16]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[17]  Jun'ichi Tsujii,et al.  Feature Forest Models for Probabilistic HPSG Parsing , 2008, CL.

[18]  Dietrich Rebholz-Schuhmann,et al.  BioLexicon: A Lexical Resource for the Biology Domain , 2008, SMBM 2008.

[19]  Nigel Collier,et al.  PASBio: predicate-argument structures for event extraction in molecular biology , 2004, BMC Bioinformatics.

[20]  Wen-Lian Hsu,et al.  BIOSMILE: Adapting Semantic Role Labeling for Biomedical Verbs: , 2006, BioNLP@NAACL-HLT.

[21]  Anna Korhonen,et al.  Exploring subdomain variation in biomedical language , 2010, BMC Bioinformatics.

[22]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[23]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[24]  Ted Briscoe,et al.  A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora , 2007, ACL.

[25]  Ted Briscoe,et al.  The Derivation of a Grammatically Indexed Lexicon from the Longman Dictionary of Contemporary English , 1987, ACL.

[26]  Ted Briscoe,et al.  A Large Subcategorization Lexicon for Natural Language Processing Applications , 2006, LREC.