Exploring variation across biomedical subdomains

Previous research has demonstrated the importance of handling differences between domains such as "newswire" and "biomedicine" when porting NLP systems from one domain to another. In this paper we identify the related issue of subdomain variation, i.e., differences between subsets of a domain that might be expected to behave homogeneously. Using a large corpus of research articles, we explore how subdomains of biomedicine vary across a variety of linguistic dimensions and discover that there is rich variation. We conclude that an awareness of such variation is necessary when deploying NLP systems for use in single or multiple subdomains.

[1]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[2]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[3]  Jun'ichi Tsujii,et al.  Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain , 2005, IJCNLP.

[4]  Daniel Jurafsky,et al.  How Verb Subcategorization Frequencies Are Affected By Corpus Choice , 1998, COLING.

[5]  Yang Jin,et al.  Automated recognition of malignancy mentions in biomedical literature , 2006, BMC Bioinformatics.

[6]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[7]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[8]  Ted Briscoe,et al.  A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora , 2007, ACL.

[9]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[10]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Stephen Clark,et al.  Porting a lexicalized-grammar parser to the biomedical domain , 2009, J. Biomed. Informatics.

[13]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[14]  Daniel Marcu,et al.  Domain Adaptation for Statistical Classifiers , 2006, J. Artif. Intell. Res..

[15]  Nigel Collier,et al.  The Choice of Features for Classification of Verbs in Biomedical Texts , 2008, COLING.

[16]  Johan Bos,et al.  Linguistically Motivated Large-Scale NLP with C&C and Boxer , 2007, ACL.

[17]  Seth Kulick,et al.  Integrated Annotation for Biomedical Information Extraction , 2004, HLT-NAACL 2004.

[18]  Barbara Rosario,et al.  Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy , 2001, EMNLP.

[19]  Jin-Dong Kim,et al.  Exploring Domain Differences for the Design of a Pronoun Resolution System for Biomedical Text , 2008, COLING.

[20]  Sabine Bergler,et al.  Postnominal Prepositional Phrase Attachment in Proteomics , 2006, BioNLP@NAACL-HLT.

[21]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[22]  Douglas Biber,et al.  Challenging stereotypes about academic writing: Complexity, elaboration, explicitness , 2010 .

[23]  Martha Palmer,et al.  Nominalization and Alternations in Biomedical Language , 2008, PloS one.

[24]  Shalom Lappin,et al.  An Algorithm for Pronominal Anaphora Resolution , 1994, CL.