Cross-Language Similar Document Retrieval

To retrieve translations of a document is very helpful for bilingual parallel corpora construction.This paper proposes an improved approach for this purpose,which uses statistical translation model to match bilingual word-pairs,uses weights of word-pairs as features for computing similarity and uses a new Dice-based method to compute Cross-Language document similarity.The approach was evaluated by measuring the numbers of how many times the translation of a given document was identified in the top N similar documents. Although two noisy datasets were used in the experiment,about 90% translations were identified in the top 5 similar documents.The experimental results show that the weighs of bilingual words-pairs are good features for similarity computing and this approaqch can effectively find translation equivalent of a document in other languages.