The University of Maryland Senseval-3 system descriptions

For SENSEVAL-3, the University of Maryland (UMD) team focused on two primary issues: the portability of sense disambigation across languages, and the exploitation of real-world bilingual text as a resource for unsupervised sense tagging. We validated the portability of our supervised disambiguation approach by applying it in seven tasks (English, Basque, Catalan, Chinese, Romanian, Spanish, and “multilingual” lexical samples), and we experimented with a new unsupervised algorithm for sense modeling using parallel corpora. 1 Supervised Sense Tagging for Lexical Samples 1.1 Tagging Framework For the English, Basque, Catalan, Chinese, Romanian, Spanish, and “multilingual” lexical samples, we employed the UMD-SST system developed for SENSEVAL-2 (Cabezas et al., 2001); we refer the reader to that paper for a detailed system description. Briefly, UMD-SST takes a supervised learning approach, treating each word in a task’s vocabulary as an independent problem of classification into that word’s sense inventory. Each training and test item is represented as a weighted feature vector, with dimensions corresponding to properties of the context. As in SENSEVAL-2, our system supported the following kinds of features: Local context. For each = 1, 2, and 3, and for each word in the vocabulary, there is a feature representing the presence of word at a distance of words to the left of the word being disambigated; there is a corresponding set of features for the local context to the right of the word. Wide context. Each word in the training set vocabulary has a corresponding feature indicating its presence. For SENSEVAL-3, wide context features were taken from the entire training or test instance. In other settings, one might make further distinctions, e.g. between words in the same paragraph and words in the document. We also experimented with the following additional kinds of features for English: Grammatical context. We use a syntactic dependency parser (Lin, 1998) to produce, for each word to be disambiguated, features identifying relevant syntactic relationships in the sentence where it occurs. For example, in the sentence The U.S. government announced a new visa waiver policy, the word government would have syntactic features like DET:THE, MOD:U.S., and SUBJ-OF:ANNOUNCED. Expanded context. In information retrieval, we and other researchers have found that it can be useful to expand the representation of a document to include informative words from similar documents (Levow et al., 2001). In a similar spirit, we create a set of expandedcontext features by (a) treating the WSD context as a bag of words, (b) issuing it as a query to a standard information retrieval system that has indexed a large collection of documents, and (c) including the nonstopword vocabulary of the top documents returned. So, for example, in a context containing the sentence The U.S. government announced a new visa waiver policy, the query might retrieve news articles like “US to Extend Fingerprinting to Europeans, Japanese” (Bloomberg.com, April 2, 2004), leading to the addition of features like EXT:EUROPEAN, EXT:JAPANESE, EXT:FINGERPRINTING EXT:VISITORS, EXT:TOURISM, and so forth. Association for Computational Linguistics for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems