Corpus-based identification and refinement of semantic classes

Medical Language Processing (MLP), especially in specific domains, requires fine-grained semantic lexica. We examine whether robust natural language processing tools used on a representative corpus of a domain help in building and refining a semantic categorization. We test this hypothesis with ZELLIG, a corpus analysis tool. The first clusters we obtain are consistent with a model of the domain, as found in the SNOMED nomenclature. They correspond to coarse-grained semantic categories, but isolate as well lexical idiosyncrasies belonging to the clinical sub-language. Moreover, they help categorize additional words.

[1]  Didier Bourigault An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation , 1993, EACL.

[2]  C. Chute,et al.  The content coverage of clinical classifications. For The Computer-Based Patient Record Institute's Work Group on Codes & Structures. , 1996, Journal of the American Medical Informatics Association : JAMIA.

[3]  P Zweigenbaum,et al.  MENELAS: an access system for medical records using natural language. , 1994, Computer methods and programs in biomedicine.

[4]  Adeline Nazarenko,et al.  Symbolic word clustering for medium-size corpora , 1996, COLING.

[5]  Christopher G. Chute,et al.  The Content Coverage of Clinical Classifications , 1996 .

[6]  J J Cimino,et al.  Coding Systems in Health Care , 1995, Yearbook of Medical Informatics.

[7]  R. Côté Systematized nomenclature of human and veterinary medicine : SNOMED international , 1993 .

[8]  Gregory Grefenstette,et al.  Corpus-Derived First, Second and Third-Order Word Affinities , 1994 .

[9]  D A Evans,et al.  Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[10]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[11]  Zellig S. Harris,et al.  The form of information in science , 1988 .

[12]  Roberto Basili,et al.  Integrating General-purpose and Corpus-based Verb Classification , 1996, Comput. Linguistics.

[13]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[14]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.