From Words to Corpora: Recognizing Translation

This paper presents a technique for discovering translationally equivalent texts. It is comprised of the application of a matching algorithm at two different levels of analysis and a well-founded similarity score. This approach can be applied to any multilingual corpus using any kind of translation lexicon; it is therefore adaptable to varying levels of multilingual resource availability. Experimental results are shown on two tasks: a search for matching thirty-word segments in a corpus where some segments are mutual translations, and classification of candidate pairs of web pages that may or may not be translations of each other. The latter results compare competitively with previous, document-structure-based approaches to the same problem.