An Improvement in Cross-Language Document Retrieval Based on Statistical Models

This paper presents a proposed method integrated with three statistical models including Translation model, Query generation model and Document retrieval model for cross-language document retrieval. Given a certain document in the source language, it will be translated into the target language of statistical machine translation model. The query generation model then selects the most relevant words in the translated version of the document as a query. Finally, all the documents in the target language are scored by the document searching model, which mainly computes the similarities between query and document. This method can efficiently solve the problem of translation ambiguity and query expansion for disambiguation, which are critical in Cross-Language Information Retrieval. In addition, the proposed model has been extensively evaluated to the retrieval of documents that: 1) texts are long which, as a result, may cause the model to over generate the queries; and 2) texts are of similar contents under the same topic which is hard to be distinguished by the retrieval model. After comparing different strategies, the experimental results show a significant performance of the method with the average precision close to 100%. It is of a great significance to both cross-language searching on the Internet and the parallel corpus producing for statistical machine translation systems.

[1]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2000 , 2000, CLEF.

[2]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[3]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[4]  Mark W. Davis,et al.  QUILT: implementing a large-scale cross-language text retrieval system , 1997, SIGIR '97.

[5]  Vibhu O. Mittal,et al.  Bridging the lexical chasm: statistical approaches to answer-finding , 2000, SIGIR '00.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[10]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Felipe Sánchez-Martínez,et al.  DOCUMENT TRANSLATION RETRIEVAL BASED ON STATISTICAL MACHINE TRANSLATION TECHNIQUES , 2011, Appl. Artif. Intell..

[15]  Kazuaki Kishida,et al.  Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[16]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[17]  Martin Franz,et al.  Ad hoc, Cross-language and Spoken Document Information Retrieval at IBM , 1999, Text Retrieval Conference.

[18]  W. Bruce Croft,et al.  Statistical Methods for Cross-Language Information Retrieval , 1998 .