A statistical framework for query translation disambiguation

Resolving ambiguity in the process of query translation is crucial to cross-language information retrieval (CLIR), given the short length of queries. This problem is even more challenging when only a bilingual dictionary is available, which is the focus of our work described here. In this paper, we will present a statistical framework for dictionary-based CLIR that estimates the translation probabilities of query words based on the monolingual word co-occurrence statistics. In addition, we will present two realizations of the proposed framework, i.e., the “maximum coherence model” and the “spectral query-translation model,” that exploit different metrics for the coherence measurement between a translation of a query word and the theme of the entire query. Compared to previous work on dictionary-based CLIR, the proposed framework is advantageous in three aspects: (1) Translation probabilities are calculated explicitly to capture the uncertainty in translating queries; (2) translations of all query words are estimated simultaneously rather than independently; and (3) the formulated problem can be solved efficiently with a unique optimal solution. Empirical studies with Chinese--English cross-language information retrieval using TREC datasets have shown that the proposed models achieve a relative 10%--50% improvement, compared to other approaches that also exploit word co-occurrence statistics for query translation disambiguation.

[1]  Djoerd Hiemstra,et al.  Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[2]  W. Rudin Real and complex analysis, 3rd ed. , 1987 .

[3]  Gerard Salton,et al.  The SMART Retrieval System , 1971 .

[4]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[5]  Jianfeng Gao,et al.  Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations , 2002, SIGIR '02.

[6]  Philip E. Gill,et al.  Practical optimization , 1981 .

[7]  Jinxi Xu,et al.  TREC-9 Cross-lingual Retrieval at BBN , 2000, TREC.

[8]  Kalervo Järvelin,et al.  Proceedings of Sheffield SIGIR, 2004, July 25th-29th : the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in information Retrieval , 2004 .

[9]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[10]  Chris H. Q. Ding,et al.  A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[12]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[13]  Masatoshi Yoshikawa,et al.  Query term disambiguation for Web cross-language information retrieval using a search engine , 2000, IRAL '00.

[14]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[15]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[16]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[17]  Alexander M. Fraser,et al.  TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.

[18]  Michael I. Jordan,et al.  Variational methods for inference and estimation in graphical models , 1997 .

[19]  Christof Monz,et al.  Iterative translation disambiguation for cross-language information retrieval , 2005, SIGIR '05.

[20]  Gene H. Golub,et al.  Matrix computations , 1983 .

[21]  Mirna Adriani Dictionary-based CLIR for the CLEF Multilingual Track , 2000, CLEF.

[22]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[23]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[24]  Mark W. Davis,et al.  New Experiments In Cross-Language Text Retrieval At NMSU's Computing Research Lab , 1996, TREC.

[25]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[26]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[27]  Walter Rudin,et al.  Real & Complex Analysis , 1987 .

[28]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[29]  Wessel Kraaij,et al.  Different approaches to Cross Language Information Retrieval , 2000, CLIN.

[30]  Mirna Adriani Using Statistical Term Similarity for Sense Disambiguation in Cross-Language Information Retrieval , 2004, Information Retrieval.

[31]  Jian-Yun Nie,et al.  Using Statistical Translation Models for Bilingual IR , 2001, CLEF.

[32]  Sung-Hyon Myaeng,et al.  Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting , 1999, ACL.

[33]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.