论文信息 - Should MT Systems Be Used as Black Boxes in CLIR?

Should MT Systems Be Used as Black Boxes in CLIR?

The translation stage in cross language information retrieval (CLIR) acts as the main enabling stage to cross the language barrier between documents and queries. In recent years machine translation (MT) systems have become the dominant approach to translation in CLIR. However, unlike information retrieval (IR), MT focuses on the morphological and syntactical quality of the sentence. This requires large training resources and high computational power for training and translation. We present a novel technique for MT designed specifically for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this pre-processing step is found to significantly speed up the translation process without affecting the retrieval quality.

Walid Magdy | Gareth J. F. Jones | Walid Magdy | G. Jones

[1] W. Bruce Croft,et al. Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[2] John Tait,et al. CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain , 2009, CLEF.

[3] Gregory Grefenstette,et al. Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[4] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[5] Chris Callison-Burch,et al. Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[6] Yanjun Ma,et al. MaTrEx: the DCU machine translation system for IWSLT 2007 , 2007, IWSLT.

[7] Walid Magdy,et al. Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task , 2010, CLEF.

[8] Andy Way,et al. MATREX: DCU machine translation system for IWSLT 2006. , 2006, IWSLT.

[9] Walid Magdy,et al. PRES: a score metric for evaluating recall-oriented information retrieval applications , 2010, SIGIR.