A Probabilistic Translation Method for Dictionary-based Cross-lingual Information Retrieval in Agglutinative Languages

School of Electrical and Computer Engineering,College of Engineering,University of Tehran, Tehran, Iran.{dadashkarimi,shakery,hfaili}@ut.ac.irAbstract. Translation ambiguity, out of vocabulary words and missingsome translations in bilingual dictionaries make dictionary-based Cross-language Information Retrieval (CLIR) a challenging task. Moreover,in agglutinative languages which do not have reliable stemmers, miss-ing various lexical formations in bilingual dictionaries degrades CLIRperformance. This paper aims to introduce a probabilistic translationmodel to solve the ambiguity problem, and also to provide most likelyformations of a dictionary candidate. We propose Minimum Edit Sup-port Candidates (MESC) method that exploits a monolingual corpusand a bilingual dictionary to translate users’ native language queries todocuments’ language. Our experiments show that the proposed methodoutperforms state-of-the-art dictionary-based English-Persian CLIR.

[1]  Azadeh Shakery,et al.  Exploiting Multiple Translation Resources for English-Persian Cross Language Information Retrieval , 2013, CLEF.

[2]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[3]  Fredric C. Gey,et al.  English-Chinese Cross-Language IR Using Bilingual Dictionaries , 2000, TREC.

[4]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[5]  Azadeh Shakery,et al.  Mining a Persian-English comparable corpus for cross-language information retrieval , 2014, Inf. Process. Manag..

[6]  Heshaam Faili,et al.  Grammatical and context‐sensitive error correction using a statistical machine translation framework , 2013, Softw. Pract. Exp..

[7]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[8]  Jian-Yun Nie Cross-Language Information Retrieval , 2010, Cross-Language Information Retrieval.

[9]  Helen Ashman,et al.  A Hybrid Technique for English-Chinese Cross Language Information Retrieval , 2008, TALIP.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[12]  Jianqiang Wang,et al.  NTCIR-2 ECIR Experiments at Maryland: Comparing Pirkola's Structured Queries and Balanced Translation , 2001, NTCIR.

[13]  Tayebeh Mosavi Miangah FarsiSpell: A spell-checking system for Persian using a large monolingual corpus , 2014, Lit. Linguistic Comput..

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Masoud Rahgozar,et al.  Hamshahri: A standard Persian text collection , 2009, Knowl. Based Syst..

[16]  Kazuaki Kishida,et al.  Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[17]  Yi Liu,et al.  A maximum coherence model for dictionary-based cross-language information retrieval , 2005, SIGIR '05.

[18]  Emi Ishita,et al.  Translation disambiguation for cross-language information retrieval using context-based translation probability , 2009, J. Inf. Sci..

[19]  Jianfeng Gao,et al.  Extending query translation to cross-language query expansion with markov chain models , 2007, CIKM '07.

[20]  Paul G. Young Cross-Language Information Retrieval Using Latent Semantic Indexing , 1994 .

[21]  Azadeh Shakery,et al.  Using Learning to Rank Approach for Parallel Corpora Based Cross Language Information Retrieval , 2012, ECAI.

[22]  Nic Gearailt,et al.  Dictionary characteristics in cross-language information retrieval , 2003 .

[23]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.