Exploring subdomain variation in biomedical language

BackgroundApplications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation.ResultsUsing the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains.ConclusionsWe find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers.

[1]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[4]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[5]  Jennifer Pearson,et al.  Terms in context , 1998 .

[6]  Dan Klein,et al.  Fast Exact Inference with a Factored Model for Natural Language Parsing , 2002, NIPS.

[7]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[8]  Diarmuid Ó Séaghdha Latent Variable Models of Selectional Preference , 2010, ACL.

[9]  Jin-Dong Kim,et al.  Exploring Domain Differences for the Design of a Pronoun Resolution System for Biomedical Text , 2008, COLING.

[10]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[11]  Ted Briscoe,et al.  The Second Release of the RASP System , 2006, ACL.

[12]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[13]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[14]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[15]  Fei Xia,et al.  Some Experiments on Indicators of Parsing Complexity for Lexicalized Grammars , 2000, ELSPS.

[16]  Neville Ryant,et al.  A large-scale classification of English verbs , 2008, Lang. Resour. Evaluation.

[17]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[18]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.

[19]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[20]  Maurizio Gotti,et al.  Investigating Specialized Discourse , 2006 .

[21]  Richard Kittredge,et al.  Sublanguage : studies of language in restricted semantic domains , 1982 .

[22]  Stephen Clark,et al.  Porting a lexicalized-grammar parser to the biomedical domain , 2009, J. Biomed. Informatics.

[23]  Naomi Sager,et al.  Chapter 2. Automatic Information Formatting of a Medical Sublanguage , 1982 .

[24]  Heljä Lundgrén-Laine,et al.  Characteristics and Analysis of Finnish and Swedish Clinical Intensive Care Nursing Narratives , 2010, Louhi@NAACL-HLT.

[25]  Julia Hockenmaier,et al.  Data and models for statistical parsing with combinatory categorial grammar , 2003 .

[26]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[27]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[28]  Douglas Biber,et al.  Challenging stereotypes about academic writing: Complexity, elaboration, explicitness , 2010 .

[29]  Martha Palmer,et al.  Nominalization and Alternations in Biomedical Language , 2008, PloS one.

[30]  Angus Roberts,et al.  Mining clinical relationships from patient narratives , 2008, BMC Bioinformatics.

[31]  James R. Curran,et al.  Formalism-Independent Parser Evaluation with CCG and DepBank , 2007, ACL.

[32]  Alexander A. Morgan,et al.  Data preparation and interannotator agreement: BioCreAtIvE Task 1B , 2005, BMC Bioinformatics.

[33]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[34]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[35]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[36]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[37]  Xavier Carreras,et al.  Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling , 2004, CoNLL.

[38]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[39]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[40]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[41]  I. Grosse,et al.  Analysis of symbolic sequences using the Jensen-Shannon divergence. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[42]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[43]  BMC Bioinformatics , 2005 .

[44]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[45]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[46]  János Csirik,et al.  The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes , 2008, BMC Bioinformatics.

[47]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[48]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.