Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms

We here propose a new method which sets apart domain-specific terminology from common non-specific noun phrases. It is based on the observation that terminological multi-word groups reveal a considerably lesser degree of distributional variation than non-specific noun phrases. We define a measure for the observable amount of paradigmatic modifiability of terms and, subsequently, test it on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using a community-wide curated biomedical terminology system as an evaluation gold standard, we show that our algorithm significantly outperforms a variety of standard term identification measures. We also provide empirical evidence that our methodolgy is essentially domain- and corpus-size-independent.

[1]  Goran Nenadic,et al.  Terminology-driven mining of biomedical literature , 2003, SAC '03.

[2]  Olivier Bodenreider,et al.  Unsupervised,corpus-based method for extending a biomedical terminology , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[5]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Yearbook of Medical Informatics.

[6]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[7]  Hiroshi Nakagawa,et al.  A Simple but Powerful Automatic Term Extraction Method , 2002, COLING 2002.

[8]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.

[9]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[10]  Fred J. Damerau,et al.  Generating and Evaluating Domain-Oriented Multi-Word Terms from Texts , 1993, Inf. Process. Manag..

[11]  Goran Nenadic,et al.  Enhancing automatic term recognition through recognition of variation , 2004, COLING.

[12]  Udo Hahn,et al.  Collocation Extraction Based on Modifiability Statistics , 2004, COLING.

[13]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[14]  J. Schmee Applied Statistics—A Handbook of Techniques , 1984 .

[15]  Christian Jacquemin,et al.  Syntagmatic and Paradigmatic Representations of Term Variation , 1999, ACL.

[16]  Stefan Evert,et al.  Methods for the Qualitative Evaluation of Lexical Association Measures , 2001, ACL.

[17]  Nigel Collier,et al.  Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain , 2001 .

[18]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.