Translation events in cross-language information retrieval: lexical ambiguity, lexical holes, vocabulary mismatch, and correct translations

Cross-Language Information Retrieval (CLIR) systems enable users to formulate queries in their native language to retrieve documents in foreign languages. Because queries and documents in CLIR do not necessarily share the same language, translation is needed before matching can take place. This translation step tends to cause a reduction in the retrieval performance of CLIR as compared to monolingual information retrieval. The prevailing CLIR approach and the focus of this study is query translation. The translation of queries is inherently difficult due to the lack of a one-to-one mapping of a lexical item and its meaning, which creates lexical ambiguity. This, and other translation problems, result in translation errors which impact CLIR performance. To understand the events occurring in cross-language retrieval query translation and the relation of these events to retrieval performance, the study explored the following research questions: (1) What kinds of translation events affect cross-language retrieval? (2)  In what way does the presence of certain translation events in query translation affect retrieval performance? The study followed a two-phase multi-method approach. In phase one, a taxonomy of translation events was created through content analysis of queries and their translations in combination with an examination of the literature. In the second and final phase, a subset of the test queries was coded using the taxonomy resulting from phase one. These queries were then used in information retrieval experimentation to assess the impact of the translation events on retrieval performance.

[1]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[2]  John Lehrberger,et al.  Machine Translation: Linguistic characteristics of MT systems and general methodology of evaluation , 1988 .

[3]  Yorick Wilks,et al.  Book Reviews: Electric Words: Dictionaries, Computers, and Meanings , 1996, CL.

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[6]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[7]  Djoerd Hiemstra,et al.  Cross Language Retrieval with the Twenty-One system , 1997, TREC.

[8]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[9]  Chris Buckley,et al.  The TREC-8 Query Track , 1999, TREC.

[10]  David A. Hull Using Structured Queries for Disambiguation in Cross-Language Information Retrieval , 1997 .

[11]  Joel L. Fagan The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[12]  Wayne W. Crouch,et al.  Evaluating Information: A Guide for Users of Social Science Research , 1980 .

[13]  Ross Wilkinson,et al.  Cross-language Retrieval In English and Vietnamese , 1997 .

[14]  J. Scott McCarley Should we Translate the Documents or the Queries in Cross-language Information Retrieval? , 1999, ACL.

[15]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[16]  Ruxandra Domenig,et al.  SPIDER Retrieval System at TREC-5 , 1996, TREC.

[17]  Doug Arnold,et al.  Machine Translation: An Introductory Guide , 1994 .

[18]  Joel L. Fagan,et al.  The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[19]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[20]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[21]  Roger T. Bell,et al.  Translation and Translating: Theory and Practice , 1991 .

[22]  Michael B. Eisenberg Magnitude Estimation And The Measurement Of Relevance , 1987 .

[23]  S. L'Vov,et al.  The Theory and Practice of Translation , 1965 .

[24]  Harry Newton Newton's Telecom Dictionary, 17th Edition , 2001 .

[25]  Aslib,et al.  The journal of documentation , 1945 .

[26]  Douglas W. Oard,et al.  Document Translation for Cross-Language Text Retrieval at the University of Maryland , 1997, TREC.

[27]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[28]  Eva Bornemann Translation and lexicography , 1989 .

[29]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[30]  Bruce Bain Scientific and Humanistic Dimensions of Language: Festschrift for Robert Lado , 1988 .

[31]  W. Bruce Croft,et al.  Dictionary Methods for Cross-Lingual Information Retrieval , 1996, DEXA.

[32]  Russell G. Schuh,et al.  Linguistics and Bilingual Dictionaries , 1978 .

[33]  R. Kirk Experimental Design: Procedures for the Behavioral Sciences , 1970 .

[34]  Gerard Salton,et al.  Automatic Processing of Foreign Language Documents , 1969, COLING.

[35]  R. Darlington,et al.  Multiple regression in psychological research and practice. , 1968, Psychological bulletin.

[36]  Gen-ichiro Kikui,et al.  Identifying the Coding System and Language of On-line Documents on the Internet , 1996, COLING.

[37]  Peter Schäuble,et al.  Building a Large Multilingual Test Collection from Comparable News Documents , 1998 .

[38]  David Ellis,et al.  The Dilemma of Measurement in Information Retrieval Research , 1996, J. Am. Soc. Inf. Sci..

[39]  E. Michael Keen,et al.  Presenting Results of Experimental Retrieval Comparisons , 1997, Inf. Process. Manag..

[40]  A. R. Meetham,et al.  Encyclopaedia of linguistics, information, and control , 1969 .

[41]  Michael Lesk,et al.  Real Life Information Retrieval: Commercial Search Engines (Panel). , 1997, SIGIR 1997.

[42]  Peter Schäuble,et al.  ETH TREC-6: Routing, Chinese, Cross-Language and Spoken Document Retrieval , 1997, TREC.

[43]  Michael Eisenberg,et al.  Order effects: A study of the possible influence of presentation order on user judgments of document relevance , 1988, J. Am. Soc. Inf. Sci..

[44]  Ranko Bugarski Translation Across Cultures: Some Problems with Terminologies , 1985 .

[45]  Kenneth Katzner,et al.  The Languages of the World. New Edition. , 1995 .

[46]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[47]  Michael B. Eisenberg,et al.  A re-examination of relevance: toward a dynamic, situational definition , 1990, Inf. Process. Manag..

[48]  William C. Ogden,et al.  Implementing Cross-Language Text Retrieval Systems for Large-scale Text Collections and the World Wide Web , 2002 .

[49]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[50]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[51]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[52]  Stephen E. Robertson,et al.  Overview of the Okapi projects , 1997, J. Documentation.

[53]  Mark W. Davis,et al.  New Experiments In Cross-Language Text Retrieval At NMSU's Computing Research Lab , 1996, TREC.

[54]  Karen Spärck Jones Reflections on TREC , 1995, Inf. Process. Manag..

[55]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[56]  Kenneth Katzner The languages of the world , 1975 .

[57]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[58]  Jaime G. Carbonell,et al.  Evaluation Metrics for Knowledge-Based Machine Translation , 1994, COLING.

[59]  M. Banerjee,et al.  Beyond kappa: A review of interrater agreement measures , 1999 .

[60]  Angelika Storrer,et al.  Description and Acquisition of Multiword Lexemes , 1993, EAMT Workshop.

[61]  K. L. Kwok,et al.  Evaluation of an English-Chinese Cross-Lingual Retrieval Experiment , 2002 .

[62]  S. Pinker The language instinct : how the mind creates language , 1995 .

[63]  Fredric C. Gey,et al.  Manual Queries and Machine Translation in Cross-Language Retrieval and Interactive Retrieval with Cheshire II at TREC-7 , 1998, TREC.

[64]  Peter E. Pause Interlingual strategies in translation , 1994, EAMT.

[65]  David Crystal,et al.  A dictionary of linguistics and phonetics , 1997 .

[66]  Don R. Swanson,et al.  Probabilistic models for automatic indexing , 1974, J. Am. Soc. Inf. Sci..

[67]  Stephen Robertson,et al.  The methodology of information retrieval experiment , 1981 .

[68]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[69]  Mark W. Davis,et al.  A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval , 1995, TREC.

[70]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[71]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[72]  Thomas R. Gruber,et al.  Toward principles for the design of ontologies used for knowledge sharing? , 1995, Int. J. Hum. Comput. Stud..

[73]  Yiming Yang,et al.  Translingual Information Retrieval: A Comparative Evaluation , 1997, IJCAI.

[74]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[75]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[76]  Hans Uszkoreit,et al.  MULINEX: multilingual indexing, navigation and editing extensions for the world-wide web , 1997, RIAO.

[77]  Elizabeth D. Liddy,et al.  TREC-7 Evaluation of Conceptual Interlingua Document Retrieval (CINDOR) in English and French , 1998, TREC.

[78]  Harry Newton,et al.  Newton's Telecom Dictionary , 1994 .

[79]  Gerard Salton,et al.  Experiments in Multi-Lingual Information Retrieval , 1972, Inf. Process. Lett..

[80]  Ellen M. Voorhees,et al.  Overview of the seventh text retrieval conference (trec-7) [on-line] , 1999 .

[81]  Kalervo Järvelin,et al.  Employing the resolution power of search keys , 2001, J. Assoc. Inf. Sci. Technol..

[82]  Philippe Ortet,et al.  Multilingualdatabase and crosslingual interrogation in a real internet application , 1997 .

[83]  Carol Peters,et al.  Cross-Language Information Retrieval (CLIR) Track Overview , 1997, TREC.

[84]  G. Te Translation and Relevance: Cognition and Context , 1991 .

[85]  E. Nida Toward a Science of Translating: With Special Reference to Principles and Procedures Involved in Bible Translating , 1964 .

[86]  Wessel Kraaij,et al.  TNO at CLEF-2001: Comparing Translation Resources , 2001, CLEF.

[87]  R. R. K. Hartmann Lexicography, translation and the so-called language barrier , 1989 .

[88]  David Ellis,et al.  The Physical and Cognitive Paradigms in Information Retrieval Research , 1992, J. Documentation.

[89]  Ellen M. Voorhees,et al.  Overview of the Seventh Text REtrieval Conference , 1998 .

[90]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[91]  Gregory Grefenstette,et al.  Xerox TREC-6 Site Report: Cross Language Text Retrieval , 1997, TREC.

[92]  Jean M. Tague,et al.  The pragmatics of information retrieval experimentation , 1981 .

[93]  Gideon Toury,et al.  Descriptive translation studies and beyond , 1995 .

[94]  Bilge Mutlu,et al.  Qualitative Analysis , 1928, Nature.

[95]  Christopher C. White,et al.  Focus on Durability, PATH Research at the National Institute of Standards and Technology | NIST , 2001 .

[96]  Kirsten Malmkjaer,et al.  Translation and Relevance: Cognition and Context , 1992 .

[97]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[98]  Barbara H. Kwasnik,et al.  The Role of Classification in Knowledge Representation and Discovery , 1999, Libr. Trends.

[99]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[100]  ʿAlī Qāsimī Linguistics and bilingual dictionaries , 1977 .

[101]  Robert D. Macredie,et al.  Cognitive styles and hypermedia navigation: Development of a learning model , 2002, J. Assoc. Inf. Sci. Technol..

[102]  Julio Gonzalo,et al.  An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database , 1997 .

[103]  Mildred L. Larson,et al.  Meaning-Based Translation: A Guide to Cross-Language Equivalence , 1986 .

[104]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[105]  李幼升,et al.  Ph , 1989 .

[106]  J. Roscoe Fundamental Research Statistics for the Behavioral Sciences , 2004 .

[107]  David C. Blair,et al.  Some thoughts on the reported results of TREC , 2002, Inf. Process. Manag..

[108]  Stephen E. Robertson,et al.  Okapi at TREC-4 , 1995, TREC.

[109]  Donna K. Harman,et al.  The TREC Conferences , 1997, HIM.

[110]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.

[111]  Peter van Bart,et al.  Grammatica van het Nederlands : een inleiding , 1998 .

[112]  Carol Peters,et al.  Using Linguistic Tools and Resources in Cross-Language Retrieval , 1997 .

[113]  S. Pinker,et al.  The Language Instinct: How the Mind Creates Language , 1994 .

[114]  E. Michael Keen,et al.  Laboratory tests of manual systems , 1981 .

[115]  Karl E. Weick,et al.  Evaluating information: A guide for users of social science research. , 1979 .

[116]  Birte Prahl,et al.  Translation problems and translation strategies involved in human and machine translation: Empirical studies , 2013, EAMT.

[117]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[118]  W.J.R. Martin,et al.  Van Dale Groot woordenboek Nederlands-Engels , 1998 .

[119]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[120]  J. Blois,et al.  Problèmes de la traduction automatique , 1968 .

[121]  Nicholas J. Belkin,et al.  Ask for Information Retrieval: Part I. Background and Theory , 1997, J. Documentation.

[122]  Susan Šarčević Lexicography and translation across cultures , 1989 .

[123]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[124]  David Dubin Measurement in information science , 1997 .

[125]  Rainer Schulte Translation Theory: A Challenge for the Future , 1987 .

[126]  B. Tabachnick,et al.  Using Multivariate Statistics , 1983 .

[127]  James Mayfield,et al.  Comparing cross-language query expansion techniques by degrading translation resources , 2002, SIGIR '02.

[128]  Helmut Feldweg,et al.  GermaNet - a Lexical-Semantic Net for German , 1997 .

[129]  Peter Ingwersen,et al.  Information Retrieval Interaction , 1992 .

[130]  W. Bruce Croft,et al.  Term clustering of syntactic phrases , 1989, SIGIR '90.

[131]  Eduard Hovy,et al.  Machine Translation and the Information Soup , 2002, Lecture Notes in Computer Science.

[132]  Philip Resnik,et al.  Evaluating Multilingual Gisting of Web Pages , 1997, ArXiv.

[133]  Martin Braschler,et al.  SPIDER Retrieval System at TREC7 , 1998, TREC.

[134]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[135]  E. Nida Analysis of Meaning and Dictionary Making , 1958, International Journal of American Linguistics.

[136]  J. Elashoff,et al.  Multiple Regression in Behavioral Research. , 1975 .

[137]  V. Rich Personal communication , 1989, Nature.

[138]  Tefko Saracevic,et al.  RELEVANCE: A review of and a framework for the thinking on the notion in information science , 1997, J. Am. Soc. Inf. Sci..