Cross-Language Information Retrieval

Search for information is no longer exclusively limited within the native language of the user, but is more and more extended to other languages. This gives rise to the problem of cross-language information retrieval (CLIR), whose goal is to find relevant information written in a different language to a query. In addition to the problems of monolingual information retrieval (IR), translation is the key problem in CLIR: one should translate either the query or the documents from a language to another. However, this translation problem is not identical to full-text machine translation (MT): the goal is not to produce a human-readable translation, but a translation suitable for finding relevant documents. Specific translation methods are thus required. The goal of this book is to provide a comprehensive description of the specifi c problems arising in CLIR, the solutions proposed in this area, as well as the remaining problems. The book starts with a general description of the monolingual IR and CLIR problems. Different classes of approaches to translation are then presented: approaches using an MT system, dictionary-based translation and approaches based on parallel and comparable corpora. In addition, the typical retrieval effectiveness using different approaches is compared. It will be shown that translation approaches specifically designed for CLIR can rival and outperform high-quality MT systems. Finally, the book offers a look into the future that draws a strong parallel between query expansion in monolingual IR and query translation in CLIR, suggesting that many approaches developed in monolingual IR can be adapted to CLIR. The book can be used as an introduction to CLIR. Advanced readers can also find more technical details and discussions about the remaining research challenges in the future. It is suitable to new researchers who intend to carry out research on CLIR.

[1]  Noriko Kando,et al.  A Hybrid Approach to Query and Document Translation Using a Pivot Language for Cross-Language Information Retrieval , 2005, CLEF.

[2]  Eduard Hovy,et al.  Machine Translation: Interlingual Methods , 2006 .

[3]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[4]  Turid Hedlund,et al.  Utaclir @ CLEF 2001 - Effects of Compound Splitting and N-Gram Techniques , 2001, CLEF.

[5]  Susan T. Dumais,et al.  Automatic cross-linguistic information retrieval using latent semantic indexing , 2007 .

[6]  Haizhou Li,et al.  Learning Transliteration Lexicons from the Web , 2006, ACL.

[7]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[8]  Kenneth Ward Church,et al.  A Program for Aligning Sentences in Bilingual Corpora , 1993, CL.

[9]  C. J. van Rijsbergen,et al.  Phrase Identification in Cross-Language Information Retrieval , 2000, RIAO.

[10]  Alexander M. Fraser,et al.  Empirical studies in strategies for Arabic retrieval , 2002, SIGIR '02.

[11]  Christof Monz,et al.  Iterative translation disambiguation for cross-language information retrieval , 2005, SIGIR '05.

[12]  Martin Kay,et al.  Text-Translation Alignment , 1993, Comput. Linguistics.

[13]  Douglas W. Oard,et al.  Document Translation for Cross-Language Text Retrieval at the University of Maryland , 1997, TREC.

[14]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[15]  Jian-Yun Nie,et al.  Query expansion using term relationships in language models for information retrieval , 2005, CIKM '05.

[16]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[17]  Qun Liu,et al.  Chinese Lexical Analysis Using Hierarchical Hidden Markov Model , 2003, SIGHAN.

[18]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[19]  Kalervo Järvelin,et al.  Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules , 2007, TOIS.

[20]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[21]  Jan Snajder,et al.  Automatic acquisition of inflectional lexica for morphological normalisation , 2008, Inf. Process. Manag..

[22]  Masatoshi Yoshikawa,et al.  Query term disambiguation for Web cross-language information retrieval using a search engine , 2000, IRAL '00.

[23]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI) and TREC-2 , 1993, TREC.

[24]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[25]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[26]  Gregory Grefenstette,et al.  Automatic transliteration for Japanese-to-English text retrieval , 2003, SIGIR.

[27]  Martin Braschler,et al.  Using Corpus-Based Approaches in a System for Multilingual Information Retrieval , 2000, Information Retrieval.

[28]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[29]  Kam-Fai Wong,et al.  Introduction to Chinese Natural Language Processing , 2009, Introduction to Chinese Natural Language Processing.

[30]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[31]  Fredric C. Gey,et al.  Combining multiple sources for short query translation in Chinese-English cross-language information retrieval , 2000, IRAL '00.

[32]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[33]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[34]  Pu-Jen Cheng,et al.  Translating unknown queries with web corpora for cross-language information retrieval , 2004, SIGIR '04.

[35]  Kui-Lam Kwok,et al.  TREC-5 English and Chinese Retrieval Experiments using PIRCS , 1996, TREC.

[36]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[37]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[38]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[39]  Julio Gonzalo,et al.  Interactive Cross-Language Document Selection , 2004, Information Retrieval.

[40]  Jacques Savoy A stemming procedure and stopword list for general French corpora , 1999 .

[41]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[42]  Tadeusz Radecki,et al.  Fuzzy set theoretical approach to document retrieval , 1979, Inf. Process. Manag..

[43]  A. Kumaran,et al.  Cross-Lingual Information Retrieval System for Indian Languages , 2008, IJCNLP.

[44]  Michael H. Böhlen,et al.  Translingual Information Retrieval , 2009, Encyclopedia of Database Systems.

[45]  Douglas W. Oard,et al.  Mandarin-English Information: Investigating Translingual Speech Retrieval , 2001, HLT.

[46]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[47]  Yi Liu,et al.  A maximum coherence model for dictionary-based cross-language information retrieval , 2005, SIGIR '05.

[48]  Jianfeng Gao,et al.  A study of statistical models for query translation: finding a good unit of translation , 2006, SIGIR.

[49]  Fredric C. Gey,et al.  Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval , 2001, TREC.

[50]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[51]  Fredric C. Gey,et al.  English-German Cross-Language Retrieval for the GIRT Collection - Exploiting a Multilingual Thesaurus , 1999, TREC.

[52]  Stephen Tomlinson,et al.  Experiments with Decompounded Chinese, Japanese and Korean Words Parsed by Hummingbird SearchServerTM at NTCIR-4 , 2004 .

[53]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[54]  Jianfeng Gao,et al.  Extending query translation to cross-language query expansion with markov chain models , 2007, CIKM '07.

[55]  Mirna Adriani,et al.  The Performance of a Machine Translation-Based English-Indonesian CLIR System , 2005, CLEF.

[56]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[57]  Helen Ashman,et al.  A Hybrid Technique for English-Chinese Cross Language Information Retrieval , 2008, TALIP.

[58]  Julio Gonzalo,et al.  An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database , 1997 .

[59]  Ying Zhang,et al.  Using the web for automated translation extraction in cross-language information retrieval , 2004, SIGIR '04.

[60]  Lisa Ballesteros,et al.  Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis , 2002, SIGIR '02.

[61]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[62]  Key-Sun Choi,et al.  Two approaches for the resolution of word mismatch problem caused by English words and foreign words in Korean information retrieval , 2000, IRAL '00.

[63]  Carolyn J. Crouch,et al.  Experiments in automatic statistical thesaurus construction , 1992, SIGIR '92.

[64]  W. Bruce Croft,et al.  INQUERY System Overview , 1993, TIPSTER.

[65]  W. Bruce Croft,et al.  A general language model for information retrieval (poster abstract) , 1999, SIGIR '99.

[66]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[67]  Hsin-Hsi Chen,et al.  Translating–transliterating named entities for multilingual information access , 2006 .

[68]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[69]  Ying Zhang,et al.  Mining Key Phrase Translations from Web Corpora , 2005, HLT.

[70]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[71]  Stanley F. Chen,et al.  Aligning Sentences in Bilingual Corpora Using Lexical Information , 1993, ACL.

[72]  Philip Resnik,et al.  Parallel strands: a preliminary investigation into mining the Web for bilingual text , 1998, AMTA.

[73]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[74]  Hang Li,et al.  Exploring Asymmetric Clustering for Statistical Language Modeling , 2002, ACL.

[75]  Gerard Salton,et al.  Automatic Processing of Foreign Language Documents , 1969, COLING.

[76]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[77]  Jian-Yun Nie,et al.  Integrating word relationships into language models , 2005, SIGIR '05.

[78]  Marcello Federico,et al.  Statistical cross-language information retrieval using n-best query translations , 2002, SIGIR '02.

[79]  Jian-Yun Nie,et al.  Query expansion and query translation as logical inference , 2003, J. Assoc. Inf. Sci. Technol..

[80]  Jacques Savoy,et al.  Searching strategies for the Bulgarian language , 2007, Information Retrieval.

[81]  Karin M. Verspoor,et al.  Automatic English-Chinese name transliteration for development of multilingual resources , 1998, ACL.

[82]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[83]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[84]  Fredric C. Gey,et al.  Experiments on Cross-language and Patent Retrieval at NTCIR-3 Workshop , 2002, NTCIR.

[85]  Jinxi Xu,et al.  Cross-lingual Information Retrieval Using Hidden Markov Models , 2000, EMNLP.

[86]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[87]  Jacques Savoy Stemming of French words based on grammatical categories , 1993 .

[88]  Douglas W. Oard,et al.  Dictionary-based techniques for cross-language information retrieval , 2005, Inf. Process. Manag..

[89]  Jian-Yun Nie,et al.  Using a Probabilistic Translation Model for Cross-Language Information Retrieval , 1998, VLC@COLING/ACL.

[90]  Kevin Knight,et al.  Machine Transliteration , 1997, CL.

[91]  Edward A. Fox,et al.  Research Contributions , 2014 .

[92]  Shihong Huang,et al.  Issues of content and structure for a multilingual web site , 2001, SIGDOC '01.

[93]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[94]  András A. Benczúr,et al.  Performing Cross-Language Retrieval with Wikipedia , 2007, CLEF.

[95]  Jianqiang Wang,et al.  Combining bidirectional translation and synonymy for cross-language information retrieval , 2006, SIGIR.

[96]  Mark Sanderson,et al.  Improving cross language retrieval with triangulated translation , 2001, SIGIR '01.

[97]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[98]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[99]  John Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR 1999.

[100]  Jinxi Xu,et al.  Empirical studies on the impact of lexical resources on CLIR performance , 2005, Inf. Process. Manag..

[101]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[102]  Vincent Claveau,et al.  Automatic Morphological Query Expansion Using Analogy-Based Machine Learning , 2007, ECIR.

[103]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[104]  Jian-Yun Nie,et al.  Using query contexts in information retrieval , 2007, SIGIR.

[105]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[106]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[107]  Jacques Savoy,et al.  Light stemming approaches for the French, Portuguese, German and Hungarian languages , 2006, SAC.

[108]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[109]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[110]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[111]  W. J. Hutchins Machine Translation: Past, Present, Future , 1986 .

[112]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[113]  Kenji Suzuki,et al.  Using the Web as a Bilingual Dictionary , 2001, DDMMT@ACL.

[114]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[115]  Hae-Chang Rim,et al.  Improving query translation in English-Korean cross-language information retrieval , 2005, Inf. Process. Manag..

[116]  Fredric C. Gey,et al.  Manual Queries and Machine Translation in Cross-Language Retrieval and Interactive Retrieval with Cheshire II at TREC-7 , 1998, TREC.

[117]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[118]  Jian-Yun Nie,et al.  Automatic construction of parallel English-Chinese corpus for cross-language information retrieval , 2000, ANLP.

[119]  Ying Zhang,et al.  Domain-Specific Query Translation for Multilingual Information Access using Machine Translation Augmented With Dictionaries Mined from Wikipedia , 2008, IJCNLP.

[120]  James Mayfield,et al.  Comparing cross-language query expansion techniques by degrading translation resources , 2002, SIGIR '02.

[121]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[122]  Ying Zhang,et al.  Multilingual Search for Cultural Heritage Archives via Combining Multiple Translation Resources , 2007, LaTeCH@ACL 2007.

[123]  Isabelle Moulinier,et al.  Thomson Legal and Regulatory Experiments at CLEF-2005 , 2005, CLEF.

[124]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[125]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[126]  Ying Zhang,et al.  Mining translations of OOV terms from the web through cross-lingual query expansion , 2005, SIGIR '05.

[127]  Hsin-Hsi Chen,et al.  Description of NTU Approach to NTCIR3 Multilingual Information Retrieval , 2002, NTCIR.

[128]  Dekai Wu,et al.  Aligning a Parallel English-Chinese Corpus Statistically With Lexical Criteria , 1994, ACL.

[129]  Donald H. Kraft,et al.  Fuzzy Sets and Generalized Boolean Retrieval Systems , 1983, Int. J. Man Mach. Stud..

[130]  Miguel E. Ruiz,et al.  CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation , 1999, TREC.

[131]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[132]  Kalervo Järvelin,et al.  Transitive dictionary translation challenges direct dictionary translation in CLIR , 2004, Inf. Process. Manag..

[133]  Jianqiang Wang,et al.  User-assisted query translation for interactive cross-language information retrieval , 2008, Inf. Process. Manag..

[134]  Jacques Savoy,et al.  Stemming Approaches for East European Languages , 2008, CLEF.

[135]  J. H. Lee,et al.  n-Gram-based indexing for Korean text retrieval , 1999, Inf. Process. Manag..

[136]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[137]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[138]  Jian-Yun Nie,et al.  Comparing different units for query translation in Chinese cross-language information retrieval , 2007 .

[139]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[140]  Qing Li,et al.  Concept unification of terms in different languages via web mining for Information Retrieval , 2009, Inf. Process. Manag..

[141]  Fredric C. Gey,et al.  Chinese text retrieval without using a dictionary , 1997, SIGIR '97.

[142]  Jian-Yun Nie,et al.  Using Statistical Translation Models for Bilingual IR , 2001, CLEF.

[143]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[144]  Enrique Alfonseca,et al.  Decompounding query keywords from compounding languages , 2008, ACL.

[145]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2000 , 2000, CLEF.

[146]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[147]  J. Scott McCarley Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.

[148]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[149]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[150]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[151]  Julio Gonzalo,et al.  Noun phrases as building blocks for cross-language Search Assistance , 2005, Inf. Process. Manag..

[152]  Salim Roukos,et al.  Ad hoc and Multilingual Information Retrieval at IBM , 1998, TREC.

[153]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[154]  Douglas W. Oard,et al.  CLIR Experiments at Maryland for TREC 2002: Evidence Combination for Arabic-English Retrieval , 2002, TREC.

[155]  Kalervo Järvelin,et al.  Fuzzy translation of cross-lingual spelling variants , 2003, SIGIR.

[156]  James Mayfield,et al.  Cross-Language Retrieval Using HAIRCUT for CLEF 2004 , 2004, CLEF.

[157]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[158]  Hsi-Jian Lee,et al.  Anchor text mining for translation of Web queries: A transitive translation approach , 2004, TOIS.

[159]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[160]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[161]  Tatsunori Mori,et al.  Cross-Lingual Information Retrieval based on LSI with Multiple Word Spaces , 2001, NTCIR.

[162]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[163]  Mark W. Davis,et al.  Free Resources And Advanced Alignment For Cross-Language Text Retrieval , 1997, TREC.

[164]  Sung-Hyon Myaeng,et al.  Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting , 1999, ACL.

[165]  Carol Peters,et al.  Cross-Language Information Retrieval (CLIR) Track Overview , 1997, TREC.

[166]  Jianfeng Gao,et al.  Dependence language model for information retrieval , 2004, SIGIR '04.