Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection. This kind of vocabulary is often out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In historic document retrieval (HDR), OCR errors and historical spelling variants cause similar problems. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpus-based approach, parallel or comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are also presented. The methods are shown to be effective in query translation without dictionaries between closely related languages (TRT and s-grams), OOV word translation (s-grams), and boosting dictionary-based CLIR performance by way of OOV word translation (corpus based methods).

[1]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[2]  Carol Peters Introduction to the CLEF 2003 Working Notes , 2003 .

[3]  Pu-Jen Cheng,et al.  Translating unknown queries with web corpora for cross-language information retrieval , 2004, SIGIR '04.

[4]  Kalervo Järvelin,et al.  Translating cross-lingual spelling variants using transformation rules , 2005, Inf. Process. Manag..

[5]  Tuomas Talvensaari Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR , 2008, ECIR.

[6]  Anni Järvelin,et al.  Dictionary-independent translation in CLIR between closely related languages , 2006 .

[7]  Falk Scholer,et al.  English to Persian Transliteration , 2006, SPIRE.

[8]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[9]  M. F. Fuller,et al.  Practical Nonparametric Statistics; Nonparametric Statistical Inference , 1973 .

[10]  Johanna Barddal,et al.  Nordiska: Våra språk förr och nu , 1997 .

[11]  Douglas W. Oard,et al.  Structured translation for cross-language information retrieval , 2000, SIGIR '00.

[12]  Turid Hedlund,et al.  UTACLIR -: general query translation framework for several language pairs , 2002, SIGIR '02.

[13]  Turid Hedlund Dictionary-Based Cross-Language Information Retrieval , 2003 .

[14]  William R. Hersh,et al.  Report on the TREC 2004 genomics track , 2005, SIGF.

[15]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[16]  Kalervo Järvelin,et al.  s-grams: Defining generalized n-grams for information retrieval , 2007, Inf. Process. Manag..

[17]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[18]  Anni Järvelin,et al.  Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation , 2008, SPIRE.

[19]  Kalervo Järvelin,et al.  Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants , 2002, Inf. Res..

[20]  Peter Willett,et al.  Searching for historical word-forms in a database of 17th-century English text using spelling-correction methods , 1992, SIGIR '92.

[21]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[22]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002 , 2004, Information Retrieval.

[23]  Kalervo Järvelin,et al.  FITE-TRT: a high quality translation technique for OOV words , 2006, SAC '06.

[24]  Tetsuya Ishikawa,et al.  Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration , 2001, Comput. Humanit..

[25]  Leah S. Larkey,et al.  Statistical transliteration for english-arabic cross language information retrieval , 2003, CIKM '03.

[26]  Peter Willett,et al.  Applications of n-grams in textual information systems , 1998, J. Documentation.

[27]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[28]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[29]  Alexander M. Robertson,et al.  Word Variant Identification in Old French , 1997, Inf. Res..

[30]  Martti Juhola,et al.  Focused web crawling in the acquisition of comparable corpora , 2008, Information Retrieval.

[31]  Norbert Fuhr,et al.  Retrieval Effectiveness of Proper Name Search Methods , 1996, Inf. Process. Manag..