A statistical method for language-independent representation of the topical content of text segments

Where there are texts in more than one language, it would be desirable if users could give queries or examples in the language in which they are most competent and obtain relevant text passages in any language. We have developed and tested a prototype system that makes this possible. The system is based entirely on a statistical technique that requires no humanly constructed dictionary, thesaurus, or term bank. The language-independent representation of text has two steps. In the rst, done just once for a subject area, a sample collection of parallel texts|paragraph-by-paragraph translations in two or more languages|is analyzed by a mathematical technique called Singular Value Decomposition. Each word in the sample is assigned a vector value determined by the total pattern of usage of all the words in all the sample paragraphs. In the second step, a new document or query in any of the original languages is assigned a vector value that is an average of the values of the words it contains. Tests on a French-English corpus showed that the method works well. Key-words: Interlingua, IR, information retrieval, statistical techniques, LSI, SVD, semantics, translation, multilingual lters