Translating unknown cross-lingual queries in digital libraries using a Web-based approach

Users' cross-lingual queries to a digital library system might be short and not included in a common translation dictionary (unknown terms). In this paper, we investigate the feasibility of exploiting the Web as the corpus source to translate unknown query terms for cross-language information retrieval (CLIR) in digital libraries. We propose a Web-based term translation approach to determine effective translations for unknown query terms by mining bilingual search-result pages obtained from a real Web search engine. This approach can enhance the construction of a domain-specific bilingual lexicon and benefit CLIR services in a digital library that only has monolingual document collections. Very promising results have been obtained in generating effective translation equivalents for many unknown terms, including proper nouns, technical terms and Web query terms.

[1]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[2]  Lee-Feng Chien,et al.  PAT-tree-based keyword extraction for Chinese information retrieval , 1997, SIGIR '97.

[3]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[4]  Kui-Lam Kwok NTCIR-2 Chinese, Cross Language Retrieval Experiments Using PIRCS , 2001, NTCIR.

[5]  Fredric C. Gey,et al.  Combining multiple sources for short query translation in Chinese-English cross-language information retrieval , 2000, IRAL '00.

[6]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[7]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[8]  Pascale Fung,et al.  An IR Approach for Translating New Words from Nonparallel, Comparable Texts , 1998, ACL.

[9]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[10]  Hsi-Jian Lee,et al.  Translation of web queries using anchor text mining , 2002, TALIP.

[11]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[12]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[13]  Ricardo A. Baeza-Yates,et al.  An Algorithm for String Matching with a Sequence of don't Cares , 1991, Inf. Process. Lett..

[14]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[15]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[16]  Hsi-Jian Lee,et al.  Anchor text mining for translation of Web queries: A transitive translation approach , 2004, TOIS.

[17]  Reinhard Rapp,et al.  Automatic Identification of Word Translations from Unrelated English and German Corpora , 1999, ACL.

[18]  Fredric C. Gey,et al.  Harvesting translingual vocabulary mappings for multilingual digital libraries , 2002, JCDL '02.