论文信息 - Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

We are presenting an approach to calculating the semantic similarity of documents written in the same or in different languages. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these representations. While EUROVOC is a carefully handcrafted knowledge structure, our procedure uses statistical techniques. The method was applied to a collection of 5990 English and Spanish parallel texts and evaluated by measuring the number of times the translation of a given document was identified as the most similar document. The good results showed the feasibility and usefulness of the approach.

[1] Philip Resnik,et al. Mining the Web for Bilingual Text , 1999, ACL.

[2] Adam Kilgarriff,et al. Which words are particularly characteristic of a text? a survey of statistical approaches , 1996 .

[3] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[4] Noah A. Smith. Detection of Translational Equivalence , 2001 .

[5] Ralf Steinberger,et al. Document Classification and Visualisation to Support the Investigation of Suspected Fraud , 2001 .

[6] Ralf Steinberger,et al. A Methodology to Retrieve, to Manage, to Classify and to Query Open Source Information , 2000 .

[7] Ralf Steinberger. Cross-lingual keyword assignment , 2001, Proces. del Leng. Natural.