Exploring the Validity of Corpus-derive d Measures of Semantic Similarity

Lexical co-occurrence counts from large corpora have been used to construct highdimensional vector-space models of language. In this type of model words are represented as vectors (or points) in a hyperspace, and distances between word vectors are generally considered to reflect semantic similarity. Two issues must be addressed if a vector-space model is to be used as a 'semantic' measuring device: reliability and validity. Do context vectors reliably measure what they are supposed to be measuring? Does ‘semantic distance’ correlate with other variables that predict variation? A simple, principled method for determining and improving the reliability of co-occurrence vectors is presented and tested in two corpus experiments. The validity question is addressed using psychological data from a semantic similarity rating task as the criterion measure.