Optimizing the Dimensionality of Clinical Term Spaces for Improved Diagnosis Coding Support

In natural language processing, dimensionality reduction is a common technique to reduce complexity that simultaneously addresses the sparseness property of language. It is also used as a means to capture some latent structure in text, such as the underlying semantics. Dimensionality reduction is an important property of the word space model, not least in random indexing, where the dimensionality is a predefined model parameter. In this paper, we demonstrate the importance of dimensionality optimization and discuss correlations between dimensionality and the size of the vocabulary. This is of particular importance in the clinical domain, where the level of noise in the text leads to a large vocabulary; it may also mitigate the effect of exploding vocabulary sizes when modeling multiword terms as single tokens. A system that automatically assigns diagnosis codes to patient record entries is shown to improve by up to 18 percentage points by manually optimizing the dimensionality.

[1]  Rickard Cöster,et al.  Using Bag-of-Concepts to Improve the Performance of Support Vector Machines in Text Categorization , 2004, COLING.

[2]  Heljä Lundgrén-Laine,et al.  Characteristics and Analysis of Finnish and Swedish Clinical Intensive Care Nursing Narratives , 2010, Louhi@NAACL-HLT.

[3]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[4]  Magnus Sahlgren,et al.  The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces , 2006 .

[5]  Trevor Cohen,et al.  Empirical distributional semantics: Methods and biomedical applications , 2009, J. Biomed. Informatics.

[6]  Magnus Sahlgren,et al.  Filaments of Meaning in Word Space , 2008, ECIR.

[7]  Byoung-Tak Zhang,et al.  An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition , 2003, PAKDD.

[8]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[9]  Martin Hassel,et al.  Exploiting Structured Data, Negation Detection and SNOMED CT Terms in a Random Indexing Approach to Clinical Coding , 2011 .

[10]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[11]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[12]  Vidas Daudaravicius The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance , 2010, CICLing.

[13]  Roger B. Bradford,et al.  An empirical study of required dimensionality for large-scale latent semantic indexing applications , 2008, CIKM '08.

[14]  H. Dalianis,et al.  The Stockholm EPR Corpus – Characteristics and Some Initial Findings , 2009 .