Multilingual modeling of cross-lingual spelling variants

Technical term translations are important for cross-lingual information retrieval. In many languages, new technical terms have a common origin rendered with different spelling of the underlying sounds, also known as cross-lingual spelling variants (CLSV).To find the best CLSV in a text database index, we contribute a formulation of the problem in a probabilistic framework, and implement this with an instance of the general edit distance using weighted finite-state transducers. Some training data is required when estimating the costs for the general edit distance. We demonstrate that after some basic training our new multilingual model is robust and requires little or no adaptation for covering additional languages, as the model takes advantage of language independent transliteration patterns.We train the model with medical terms in seven languages and test it with terms from varied domains in six languages. Two test languages are not in the training data. Against a large text database index, we achieve 64–78 % precision at the point of 100% recall. This is a relative improvement of 22% on the simple edit distance.

[1]  D. I. Hawkins,et al.  100 Statistical Tests , 1994 .

[2]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[3]  Gregory Grefenstette,et al.  Automatic transliteration for Japanese-to-English text retrieval , 2003, SIGIR.

[4]  Kazuhide Yamamoto,et al.  Detecting Transliterated Orthographic Variants via Two Similarity Metrics , 2004, COLING.

[5]  Mehryar Mohri Edit-distance of weighted automata , 2002, CIAA'02.

[6]  Kalervo Järvelin,et al.  Employing the resolution power of search keys , 2001 .

[7]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[8]  Hozumi Tanaka,et al.  A hybrid back-transliteration system for Japanese , 2004, COLING.

[9]  Gopal Kanji,et al.  100 Statistical Tests , 1994 .

[10]  Carol Peters Cross-Language Evaluation Forum - CLEF 2006 , 2006 .

[11]  Zhang Min,et al.  Direct orthographical mapping for machine transliteration , 2004, COLING 2004.

[12]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[13]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[14]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[15]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[16]  Yaser Al-Onaizan,et al.  Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[17]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[18]  Alan P. Parkes Finite State Transducers , 2008 .

[19]  Ying Zhang,et al.  Using the web for automated translation extraction in cross-language information retrieval , 2004, SIGIR '04.

[20]  JärvelinKalervo,et al.  Dictionary-Based Cross-Language Information Retrieval , 2004 .

[21]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.