Learning a Language-Independent Representation for Terms from a Partially Aligned Corpus

Cross-language latent semantic indexing is a method that learns useful language-independent vector representations of terms through a statistical analysis of a document-aligned text. This is accomplished by taking a collection of, say, English paragraphs and their translations in Spanish and processing them by singular value decomposition to yield a high-dimensional vector representation for each term in the collection. These term vectors have the property that semantically similar terms have vectors with high cosine measure, regardless of their source language. In the present work, we extend this approach to the case in which English-Spanish translations are not available, but instead, translations for documents in both languages are available in a third \bridge" language, say, French. Thus, although no aligned English-Spanish documents are used, our method creates a representation in which English and Spanish terms can be compared. The resulting vector representation of terms can be useful in natural language applications such as cross-language information retrieval and machine translation.