Corpus-based CLIR in retrieval of highly relevant documents

IR systems’ ability to retrieve highly relevant documents has become more and more important in the age of extremely large collections, such as the WWW. Our aim was to find out how corpus-based CLIR manages in retrieving highly relevant documents. We created a FinnishSwedish comparable corpus and used it as a source of knowledge for query translation. Finnish test queries were translated into Swedish and run against a Swedish test collection. Graded relevance assessments were used in evaluating the results and three relevance criterion levels – liberal, regular, and stringent – were applied. The runs were also evaluated with generalized recall and precision, which weight the retrieved documents according to their relevance level. The performance of our Comparable Corpus Translation system (Cocot) was compared to that of a dictionary-based query translation program; the two translation methods were also combined. The results indicate that corpus-based CLIR performs particularly well with highly relevant documents. In average precision, Cocot even matched the monolingual baseline on the highest rele-

[1]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[2]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[3]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[4]  Martin Braschler,et al.  Multilingual Information Retrieval Based on Document Alignment Techniques , 1998, ECDL.

[5]  Carol Peters,et al.  Cross-Language Evaluation Forum: Objectives, Results, Achievements , 2004, Information Retrieval.

[6]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[7]  Kalervo Järvelin,et al.  Dictionary-Based CLIR Loses Highly Relevant Documents , 2005, ECIR.

[8]  Kalervo Järvelin,et al.  Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants , 2002, Inf. Res..

[9]  Per Ahlgren,et al.  The effects of indexing strategy-query term combination on retrieval effectiveness in a Swedish full text database , 2004 .

[10]  Mark W. Davis,et al.  On The Effective Use of Large Parallel Corpora in Cross-Language Text Retrieval , 1998 .

[11]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[12]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[13]  M. Pett Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions , 1997 .

[14]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[15]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[16]  K. Järvelin,et al.  EVALUATING INFORMATION RETRIEVAL SYSTEMS UNDER THE CHALLENGES OF INTERACTION AND MULTIDIMENSIONAL DYNAMIC RELEVANCE , 2002 .

[17]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[18]  H. Keselman,et al.  Multiple Comparison Procedures , 2005 .

[19]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[20]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[21]  Turid Hedlund,et al.  UTACLIR -: general query translation framework for several language pairs , 2002, SIGIR '02.

[22]  K. Järvelin,et al.  The RATF formula (Kwok's formula): exploiting average term frequency in cross-language retrieval , 2002, Inf. Res..

[23]  C. Benito Annual Review of Information Science and Technology (ARIST) , 2003 .

[24]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[25]  Martin Braschler Combination Approaches for Multilingual Text Retrieval , 2004, Information Retrieval.