论文信息 - A statistical framework for query translation disambiguation

A statistical framework for query translation disambiguation

Resolving ambiguity in the process of query translation is crucial to cross-language information retrieval (CLIR), given the short length of queries. This problem is even more challenging when only a bilingual dictionary is available, which is the focus of our work described here. In this paper, we will present a statistical framework for dictionary-based CLIR that estimates the translation probabilities of query words based on the monolingual word co-occurrence statistics. In addition, we will present two realizations of the proposed framework, i.e., the “maximum coherence model” and the “spectral query-translation model,” that exploit different metrics for the coherence measurement between a translation of a query word and the theme of the entire query. Compared to previous work on dictionary-based CLIR, the proposed framework is advantageous in three aspects: (1) Translation probabilities are calculated explicitly to capture the uncertainty in translating queries; (2) translations of all query words are estimated simultaneously rather than independently; and (3) the formulated problem can be solved efficiently with a unique optimal solution. Empirical studies with Chinese--English cross-language information retrieval using TREC datasets have shown that the proposed models achieve a relative 10%--50% improvement, compared to other approaches that also exploit word co-occurrence statistics for query translation disambiguation.

[1] Djoerd Hiemstra,et al. Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[2] W. Rudin. Real and complex analysis, 3rd ed. , 1987 .

[3] Gerard Salton,et al. The SMART Retrieval System , 1971 .

[4] Fan Chung,et al. Spectral Graph Theory , 1996 .

[5] Jianfeng Gao,et al. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations , 2002, SIGIR '02.

[6] Philip E. Gill,et al. Practical optimization , 1981 .

[7] Jinxi Xu,et al. TREC-9 Cross-lingual Retrieval at BBN , 2000, TREC.

[8] Kalervo Järvelin,et al. Proceedings of Sheffield SIGIR, 2004, July 25th-29th : the Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in information Retrieval , 2004 .

[9] Michael I. Jordan,et al. An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[10] Chris H. Q. Ding,et al. A min-max cut algorithm for graph partitioning and data clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[11] Andrew McCallum,et al. Using Maximum Entropy for Text Classification , 1999 .

[12] W. Bruce Croft,et al. Cross-lingual relevance models , 2002, SIGIR '02.

[13] Masatoshi Yoshikawa,et al. Query term disambiguation for Web cross-language information retrieval using a search engine , 2000, IRAL '00.

[14] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.

[15] Gregory Grefenstette,et al. Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[16] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[17] Alexander M. Fraser,et al. TREC 2001 Cross-lingual Retrieval at BBN , 2001, TREC.

[18] Michael I. Jordan,et al. Variational methods for inference and estimation in graphical models , 1997 .

[19] Christof Monz,et al. Iterative translation disambiguation for cross-language information retrieval , 2005, SIGIR '05.

[20] Gene H. Golub,et al. Matrix computations , 1983 .

[21] Mirna Adriani. Dictionary-based CLIR for the CLEF Multilingual Track , 2000, CLEF.

[22] Christopher J. C. Burges,et al. A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[23] W. Bruce Croft,et al. Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[24] Mark W. Davis,et al. New Experiments In Cross-Language Text Retrieval At NMSU's Computing Research Lab , 1996, TREC.

[25] Wessel Kraaij,et al. Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[26] Marcello Federico,et al. Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[27] Walter Rudin,et al. Real & Complex Analysis , 1987 .

[28] Changning Huang,et al. Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[29] Wessel Kraaij,et al. Different approaches to Cross Language Information Retrieval , 2000, CLIN.

[30] Mirna Adriani. Using Statistical Term Similarity for Sense Disambiguation in Cross-Language Information Retrieval , 2004, Information Retrieval.

[31] Jian-Yun Nie,et al. Using Statistical Translation Models for Bilingual IR , 2001, CLEF.

[32] Sung-Hyon Myaeng,et al. Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting , 1999, ACL.

[33] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.