A pseudoword is a composite comprised of two or more words chosen at random; the individual occurrences of the original words within a text are replaced by their conflation. Pseudowords are a useful mechanism for evaluating the impact of word sense ambiguity in many NLP applications. However, the standard method for constructing pseudowords has some drawbacks. Because the constituent words are chosen at random, the word contexts that surround pseudowords do not necessarily reflect the contexts that real ambiguous words occur in. This in turn leads to an optimistic upper bound on algorithm performance. To address these drawbacks, we propose the use of lexical categories to create more realistic pseudowords, and evaluate the results of different variations of this idea against the standard approach.
[1]
Mark Sanderson,et al.
The impact on retrieval effectiveness of skewed frequency distributions
,
1999,
TOIS.
[2]
Kenneth Ward Church,et al.
Work on Statistical Methods for Word Sense Disambiguation
,
1992
.
[3]
Tanja Gaustad,et al.
Statistical Corpus-Based Word Sense Disambiguation: Pseudowords vs. Real Ambiguous Words
,
2001,
ACL.
[4]
Hinrich Schfitze.
Context Space
,
2001
.
[5]
Marti A. Hearst,et al.
A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text
,
2002,
Pacific Symposium on Biocomputing.
[6]
Hongfang Liu,et al.
Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS
,
2002,
J. Am. Medical Informatics Assoc..