Statistical transliteration for english-arabic cross language information retrieval

Out of vocabulary (OOV) words are problematic for cross language information retrieval. One way to deal with OOV words when the two languages have different alphabets, is to transliterate the unknown words, that is, to render them in the orthography of the second language. In the present study, we present a simple statistical technique to train an English to Arabic transliteration model from pairs of names. We call this a selected n-gram model because a two-stage training procedure first learns which n-gram segments should be added to the unigram inventory for the source language, and then a second stage learns the translation model over this inventory. This technique requires no heuristics or linguistic knowledge of either language. We evaluate the statistically-trained model and a simpler hand-crafted model on a test set of named entities from the Arabic AFP corpus and demonstrate that they perform better than two online translation sources. We also explore the effectiveness of these systems on the TREC 2002 cross language IR task. We find that transliteration either of OOV named entities or of all OOV words is an effective approach for cross language IR.

[1]  What ’ s in a Name ? : Proper Names in Arabic Cross Language Information Retrieval , 2003 .

[2]  Mark W. Davis,et al.  Free Resources And Advanced Alignment For Cross-Language Text Retrieval , 1997, TREC.

[3]  Douglas W. Oard,et al.  TREC-10 Experiments at University of Maryland CLIR and Video , 2001, TREC.

[4]  James Allan,et al.  UMass at TREC 2002: Cross Language and Novelty Tracks , 2002, TREC.

[5]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[6]  Kevin Knight,et al.  Translating Names and Technical Terms in Arabic Text , 1998, SEMITIC@COLING.

[7]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[8]  Mansur Arbabi,et al.  Algorithms for Arabic name transliteration , 1994, IBM J. Res. Dev..

[9]  Tetsuya Ishikawa,et al.  Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration , 2001, Comput. Humanit..

[10]  Leah S. Larkey,et al.  Arabic Information Retrieval at UMass in TREC-10 , 2001, TREC.

[11]  F. Gutheim A world of cities , 1972 .

[12]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[13]  Yaser Al-Onaizan,et al.  Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[14]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[15]  Fredric C. Gey,et al.  The TREC-2001 Cross-Language Information Retrieval Track: Searching Arabic Using English, French or Arabic Queries , 2001, TREC.