Effects of Aligned Corpus Quality and Size in Corpus-Based CLIR

Aligned corpora are often-used resources in CLIR systems. The three qualities of translation corpora that most dramatically affect the performance of a corpus-based CLIR system are: (1) topical nearness to the translated queries, (2) the quality of the alignments, and (3) the size of the corpus. In this paper, the effects of these factors are studied and evaluated. Topics of two different domains (news and genomics) are translated with corpora of varying alignment quality, ranging from a clean parallel corpus to noisier comparable corpora. Also, the sizes of the corpora are varied. The results show that of the three qualities, topical nearness is the most crucial factor, outweighing both other factors. This indicates that noisy comparable corpora should be used as complimentary resources, when parallel corpora are not available for the domain in question.

[1]  Douglas W. Oard,et al.  Cross-language Information Retrieval , 2021, ArXiv.

[2]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[3]  James Mayfield,et al.  Comparing cross-language query expansion techniques by degrading translation resources , 2002, SIGIR '02.

[4]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[5]  Carol Peters What Happened in CLEF 2006 , 2006, CLEF.

[6]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[7]  Turid Hedlund,et al.  UTACLIR -: general query translation framework for several language pairs , 2002, SIGIR '02.

[8]  James Allan,et al.  INQUERY at TREC-5 , 1996, TREC.

[9]  William R. Hersh,et al.  Report on the TREC 2004 genomics track , 2005, SIGF.

[10]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[11]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[12]  Martti Juhola,et al.  Creating and exploiting a comparable corpus in cross-language information retrieval , 2007, TOIS.

[13]  Jiang Zhu,et al.  The Effect of Translation Quality in MT-Based Cross-Language Information Retrieval , 2006, ACL.

[14]  Jinxi Xu,et al.  Empirical studies on the impact of lexical resources on CLIR performance , 2005, Inf. Process. Manag..

[15]  Martti Juhola,et al.  Focused web crawling in the acquisition of comparable corpora , 2008, Information Retrieval.

[16]  Fredric C. Gey,et al.  ENSM-SE at CLEF 2006 : Fuzzy Proximity Method with an Adhoc Influence Function in Evaluation of Multilingual and Multi-modal Information Retrieval 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain , 2007 .

[17]  Martin Franz,et al.  Quantifying the utility of parallel corpora , 2001, SIGIR '01.

[18]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.