Statistical parsing of varieties of clinical Finnish

OBJECTIVES In this paper, we study the development and domain-adaptation of statistical syntactic parsers for three different clinical domains in Finnish. METHODS AND MATERIALS The materials include text from daily nursing notes written by nurses in an intensive care unit, physicians' notes from cardiology patients' health records, and daily nursing notes from cardiology patients' health records. The parsing is performed with the statistical parser of Bohnet (http://code.google.com/p/mate-tools/, accessed: 22 November 2013). RESULTS A parser trained only on general language performs poorly in all clinical subdomains, the labelled attachment score (LAS) ranging from 59.4% to 71.4%, whereas domain data combined with general language gives better results, the LAS varying between 67.2% and 81.7%. However, even a small amount of clinical domain data quickly outperforms this and also clinical data from other domains is more beneficial (LAS 71.3-80.0%) than general language only. The best results (LAS 77.4-84.6%) are achieved by using as training data the combination of all the clinical treebanks. CONCLUSIONS In order to develop a good syntactic parser for clinical language variants, a general language resource is not mandatory, while data from clinical fields is. However, in addition to the exact same clinical domain, also data from other clinical domains is useful.

[1]  Nate Blaylock,et al.  Building Timelines from Narrative Clinical Records: Initial Results Based-on Deep Natural Language Understanding , 2011, BioNLP@ACL.

[2]  Tapio Salakoski,et al.  Dependency-Based PropBanking of Clinical Finnish , 2010, Linguistic Annotation Workshop.

[3]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[4]  Matthew Lease,et al.  Parsing Biomedical Literature , 2005, IJCNLP.

[5]  Stephen Clark,et al.  Porting a lexicalized-grammar parser to the biomedical domain , 2009, J. Biomed. Informatics.

[6]  Eugene Charniak,et al.  Reranking and Self-Training for Parser Adaptation , 2006, ACL.

[7]  Eugene Charniak,et al.  Self-Training for Biomedical Parsing , 2008, ACL.

[8]  T. Salakoski,et al.  Dependency Annotation of Wikipedia : First Steps Towards a Finnish Treebank , 2009 .

[9]  Bernd Bohnet,et al.  Very high accuracy and fast dependency parsing is not a contradiction , 2010, COLING 2010.

[10]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.

[11]  Bernd Bohnet,et al.  Top Accuracy and Fast Dependency Parsing is not a Contradiction , 2010, COLING.

[12]  Marie Candito,et al.  A Word Clustering Approach to Domain Adaptation: Effective Parsing of Biomedical Texts , 2011, IWPT.

[13]  Andrew MacKinlay,et al.  Information Extraction from Medication Prescriptions Within Drug Administration Data , 2013 .

[14]  Giuseppe Attardi,et al.  Experiments with a Multilanguage Non-Projective Dependency Parser , 2006, CoNLL.

[15]  Lilja Øvrelid,et al.  Informed ways of improving data-driven dependency parsing for German , 2010, COLING.

[16]  Xavier Carreras,et al.  Experiments with a Higher-Order Projective Dependency Parser , 2007, EMNLP.

[17]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[18]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[19]  Tapio Salakoski,et al.  Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers , 2009, NODALIDA.

[20]  Tapio Salakoski,et al.  A Dependency-based Analysis of Treebank Annotation Errors , 2011, DepLing.

[21]  Kenji Sagae Self-Training without Reranking for Parser Domain Adaptation and Its Impact on Semantic Role Labeling , 2010 .

[22]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[23]  Andrew B. Clegg,et al.  Evaluating and Integrating Treebank Parsers on a Biomedical Corpus , 2005, ACL 2005.

[24]  Josef van Genabith,et al.  Adapting WSJ-Trained Parsers to the British National Corpus using In-Domain Self-Training , 2007, IWPT.

[25]  Sumithra Velupillai,et al.  Something Old, Something New - Applying a Pre-trained Parsing Model to Clinical Swedish , 2011, NODALIDA.

[26]  John Dunnion,et al.  Analyzing Patient Records to Establish If and When a Patient Suffered from a Medical Condition , 2012, BioNLP@HLT-NAACL.

[27]  Tapio Salakoski,et al.  Building the essential resources for Finnish: the Turku Dependency Treebank , 2013, Language Resources and Evaluation.

[28]  Benoît Sagot,et al.  The Alpage Architecture at the SANCL 2012 Shared Task: Robust Pre-Processing and Lexical Bridging for User-Generated Content Parsing , 2012, HLT-NAACL 2012.

[29]  Maria Simi,et al.  Domain Adaptation by Active Learning , 2011, EVALITA.

[30]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .