Analysing Entity Type Variation across Biomedical Subdomains

Previous studies have shown that various biomedical subdomains have lexical, syntactic, semantic and discourse structure variations. It is essential to recognise such differences to understand that biomedical natural language processing tools, such as named entity recognisers, that work well on some subdomains may not work as well on others. In this paper, we investigate the pairwise similarity (or dissimilarity) amongst twenty selected biomedical subdomains, at the level of named entity types. We evaluate the contribution of these types in the classification task by computing the chi-squared statistic over their distributions. We then build a binary classifier for each possible pair of subdomains, the results of which indicate the subdomains that are highly different or similar to others. The findings can be of potential use to those building or using named entity recognisers in determining which types of named entities need to be taken into consideration or in adapting already existing tools.

[1]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[2]  Peter T. Corbett,et al.  Cascaded classifiers for confidence-based chemical named entity recognition , 2008, BMC Bioinformatics.

[3]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[4]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[5]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[6]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[7]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[9]  Nigel Collier,et al.  PASBio: predicate-argument structures for event extraction in molecular biology , 2004, BMC Bioinformatics.

[10]  Akinori Yonezawa,et al.  Overview of Genia Event Task in BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[11]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[12]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[13]  Anna Korhonen,et al.  Exploring subdomain variation in biomedical language , 2010, BMC Bioinformatics.

[14]  Naoaki Okazaki,et al.  Semantic Search on Digital Document Repositories based on Text Mining Results , 2009 .

[15]  Martha Palmer,et al.  Nominalization and Alternations in Biomedical Language , 2008, PloS one.

[16]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[17]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[18]  Jin-Dong Kim,et al.  Exploring Domain Differences for the Design of a Pronoun Resolution System for Biomedical Text , 2008, COLING.

[19]  Dietrich Rebholz-Schuhmann,et al.  UKPMC: a full text article resource for the life sciences , 2011, Nucleic Acids Res..

[20]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[21]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[22]  George Hripcsak,et al.  The sublanguage of cross-coverage , 2002, AMIA.

[23]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[24]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Sophia Ananiadou,et al.  How to make the most of NE dictionaries in statistical NER , 2008, BMC Bioinformatics.