What’s in a Name? Entity Type Variation across Two Biomedical Subdomains

There are lexical, syntactic, semantic and discourse variations amongst the languages used in various biomedical subdomains. It is important to recognise such differences and understand that biomedical tools that work well on some subdomains may not work as well on others. We report here on the semantic variations that occur in the sublanguages of two biomedical subdomains, i.e. cell biology and pharmacology, at the level of named entity information. By building a classifier using ratios of named entities as features, we show that named entity information can discriminate between documents from each subdomain. More specifically, our classifier can distinguish between documents belonging to each subdomain with an accuracy of 91.1% F-score.

[1]  Sophia Ananiadou,et al.  How to make the most of NE dictionaries in statistical NER , 2008, BMC Bioinformatics.

[2]  Martha Palmer,et al.  Nominalization and Alternations in Biomedical Language , 2008, PloS one.

[3]  Naoaki Okazaki,et al.  Semantic Search on Digital Document Repositories based on Text Mining Results , 2009 .

[4]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[5]  Jin-Dong Kim,et al.  Exploring Domain Differences for the Design of a Pronoun Resolution System for Biomedical Text , 2008, COLING.

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Zellig S. Harris,et al.  Mathematical structures of language , 1968, Interscience tracts in pure and applied mathematics.

[8]  Dietrich Rebholz-Schuhmann,et al.  UKPMC: a full text article resource for the life sciences , 2011, Nucleic Acids Res..

[9]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[10]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[11]  Ian Witten,et al.  Data Mining , 2000 .

[12]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[13]  K. Bretonnel Cohen,et al.  Proceedings of the BioNLP 2009 Workshop , 2009 .

[14]  Akinori Yonezawa,et al.  Overview of Genia Event Task in BioNLP Shared Task 2011 , 2011, BioNLP@ACL.

[15]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[16]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[17]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[18]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[19]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[20]  Sophia Ananiadou,et al.  Construction of an annotated corpus to support biomedical information extraction , 2009, BMC Bioinformatics.

[21]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[22]  Nigel Collier,et al.  PASBio: predicate-argument structures for event extraction in molecular biology , 2004, BMC Bioinformatics.

[23]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[24]  Sophia Ananiadou,et al.  Enriching a biomedical event corpus with meta-knowledge annotation , 2011, BMC Bioinformatics.

[25]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[26]  Anna Korhonen,et al.  Exploring subdomain variation in biomedical language , 2010, BMC Bioinformatics.