Translation disambiguation in web-based translation extraction for English-Chinese CLIR

Dictionary based translation is a traditional approach in use by cross-language information retrieval systems. However, significant performance degradation is often observed when queries contain words that do not appear in the dictionary. This is called the Out Of Vocabulary (OOV) problem. In recent years, web-based translation extraction was shown to be one of the more effective approaches to the solution of this problem. Previous work focussed on selecting the correct translation from a set of web extracted terms. The common methods for translation selection for web-based translation always rely on word frequency calculation but the results are not always satisfactory. In this paper we present our approach to the selection of terms in a more accurate manner. Our experiments show improvement in translation accuracy over other commonly used approaches.

[1]  Olga De Troyer,et al.  Designing Localized Web Sites , 2004, WISE.

[2]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[3]  Emanuele Bottazzi,et al.  Preliminaries to a DOLCE ontology of organisations , 2009, Int. J. Bus. Process. Integr. Manag..

[4]  Oren Kurland,et al.  PageRank without hyperlinks: structural re-ranking using links induced by language models , 2005, SIGIR '05.

[5]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[6]  Dan Roth,et al.  Learning question classifiers: the role of semantic information , 2005, Natural Language Engineering.

[7]  Soo-Min Kim,et al.  Determining the Sentiment of Opinions , 2004, COLING.

[8]  Naonori Ueda,et al.  Retrieving lightly annotated images using image similarities , 2005, SAC '05.

[9]  Jianqiang Wang,et al.  Comparing User-assisted and Automatic Query Translation , 2002, CLEF.

[10]  Sanjeev Khudanpur,et al.  Transliteration of Proper Names in Cross-Lingual Information Retrieval , 2003, NER@ACL.

[11]  Ying Li,et al.  KDD CUP-2005 report: facing a great challenge , 2005, SKDD.

[12]  Dominik Walcher,et al.  Toolkits for Idea Competitions: A Novel Method to Integrate Users in New Product Development , 2006 .

[13]  Gary Marchionini,et al.  Information Seeking in Electronic Environments , 1995 .

[14]  Wessel Kraaij,et al.  Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval , 2003, CL.

[15]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[16]  James Allan,et al.  Introduction to topic detection and tracking , 2002 .

[17]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[18]  Fredric C. Gey,et al.  Experiments on Cross-language and Patent Retrieval at NTCIR-3 Workshop , 2002, NTCIR.

[19]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[20]  Paul D. Clough,et al.  Multilingual interactive experiments with Flickr , 2006 .

[21]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Track. , 2004 .

[22]  Theo van Veen,et al.  Search and Retrieval in the European Library: A New Approach , 2004, D Lib Mag..

[23]  Scott E. Maxwell,et al.  Designing Experiments and Analyzing Data: A Model Comparison Perspective , 1990 .

[24]  Kazuaki Kishida,et al.  Technical issues of cross-language information retrieval: a review , 2005, Inf. Process. Manag..

[25]  Marcia J. Bates,et al.  Information search tactics , 1979, J. Am. Soc. Inf. Sci..

[26]  Fredric C. Gey,et al.  Cross-Language Information Retrieval: the way ahead , 2005, Inf. Process. Manag..

[27]  Zhu Zhang,et al.  NewsInEssence: A System For Domain-Independent, Real-Time News Clustering and Multi-Document Summarization , 2001, HLT.

[28]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[29]  Fabio Crestani,et al.  Automatic authoring and construction of hypermedia for information retrieval , 1995, Multimedia Systems.

[30]  Peter Brusilovsky,et al.  Methods and techniques of adaptive hypermedia , 1996, User Modeling and User-Adapted Interaction.

[31]  Yiming Yang,et al.  Translingual Information Retrieval: Learning from Bilingual Corpora , 1998, Artif. Intell..

[32]  John Tait,et al.  Literature Review of Cross Language Information Retrieval , 2005, WEC.

[33]  Daniela Petrelli,et al.  Which user interaction for cross-language information retrieval? Design issues and reflections , 2006 .

[34]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[35]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[36]  W. Bruce Croft,et al.  Resolving ambiguity for cross-language retrieval , 1998, SIGIR '98.

[37]  Carol Peters,et al.  CLEF 2005: Ad Hoc Track Overview , 2005, CLEF.

[38]  Vincent P. Wade,et al.  Evaluation of APeLS - An Adaptive eLearning Service Based on the Multi-model, Metadata-Driven Approach , 2004, AH.

[39]  José Luis Martínez-Fernández,et al.  MIRACLE's Approach to Multilingual Web Retrieval , 2005, CLEF.

[40]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Fourth NTCIR Workshop , 2004, NTCIR.

[41]  Daniela Petrelli,et al.  Which user interaction for cross-language information retrieval? Design issues and reflections , 2006, J. Assoc. Inf. Sci. Technol..

[42]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[43]  Anton Leuski,et al.  Making MIRACLEs: Interactive translingual search for Cebuano and Hindi , 2003, TALIP.

[44]  Carol Peters,et al.  The impact of evaluation on multilingual text retrieval , 2005, SIGIR '05.

[45]  Tony McEnery,et al.  EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation , 2002, LREC.

[46]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[47]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[48]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[49]  Computer-based translation systems and tools , 2005 .

[50]  Jack Halpern Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval , 2002, ALR@COLING.

[51]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[52]  Djoerd Hiemstra,et al.  A domain Specific Lexicon Acquisition Tool for Cross-Language Information Retrieval , 1997, RIAO.

[53]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[54]  David Evans,et al.  Tracking and summarizing news on a daily basis with Columbia's Newsblaster , 2002 .

[55]  Sung-Hyon Myaeng,et al.  Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting , 1999, ACL.

[56]  Jinxi Xu,et al.  Evaluating a probabilistic model for cross-lingual information retrieval , 2001, SIGIR '01.

[57]  Noriko Kando CLIR at NTCIR Workshop 3: Cross-Language and Cross-Genre Retrieval , 2002, CLEF.

[58]  Amanda Spink,et al.  An analysis of Web searching by European AlltheWeb.com users , 2005, Inf. Process. Manag..

[59]  W. Bruce Croft,et al.  Cross-lingual relevance models , 2002, SIGIR '02.

[60]  Licia Calvi,et al.  Creating Adaptive Hyperdocuments for and on the Web , 1997, WebNet.

[61]  Nicholas J. Belkin,et al.  The TREC Interactive Tracks: Putting the User into Search , 2005 .

[62]  Franco Salvetti,et al.  Impact of lexical filtering on overall opinion polarity identification , 2005, AAAI 2005.

[63]  Peter E. Latham,et al.  Mutual Information , 2006 .

[64]  Maria-Teresa Sagri,et al.  LOIS: Building a Multilingual Wordnet for the Legal Domain , 2005, EGOV.

[65]  Masatoshi Yoshikawa,et al.  Query term disambiguation for Web cross-language information retrieval using a search engine , 2000, IRAL '00.

[66]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[67]  Djoerd Hiemstra,et al.  Disambiguation Strategies for Cross-Language Information Retrieval , 1999, ECDL.

[68]  Pu-Jen Cheng,et al.  Translating unknown queries with web corpora for cross-language information retrieval , 2004, SIGIR '04.

[69]  Djoerd Hiemstra,et al.  Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002 , 2003, SIGF.

[70]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[71]  Peter Brusilovsky,et al.  Adaptive hypermedia: from systems to framework , 1999, CSUR.

[72]  Lora Aroyo,et al.  Embedding information retrieval in adaptive hypermedia: IR meets AHA! , 2004, New Rev. Hypermedia Multim..

[73]  Vincent P. Wade,et al.  Dynamic Content Discovery, Harvesting and Delivery, from Open Corpus Sources, for Adaptive Systems , 2006, AH.

[74]  David Evans,et al.  Identifying similarity in text: multi-lingual analysis for summarization , 2005 .

[75]  Daniela Petrelli,et al.  Concept Hierarchy across Languages in Text-Based Image Retrieval: A User Evaluation , 2005, CLEF.

[76]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[77]  Hsin-Hsi Chen,et al.  Overview of CLIR Task at the Sixth NTCIR Workshop , 2005, NTCIR.

[78]  Eugene Charniak,et al.  Unsupervised Learning of Name Structure From Coreference Data , 2001, NAACL.

[79]  Robert J. Gaizauskas,et al.  Aligning Words in English-Hindi Parallel Corpora , 2005, ParallelText@ACL.

[80]  Ying Zhang,et al.  Using the web for automated translation extraction in cross-language information retrieval , 2004, SIGIR '04.

[81]  Jennifer Widom,et al.  The Lowell database research self-assessment , 2003, CACM.

[82]  Brenda Dervin,et al.  Sense-making theory and practice: an overview of user interests in knowledge seeking and use , 1998, J. Knowl. Manag..

[83]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[84]  Changning Huang,et al.  Improving query translation for cross-language information retrieval using statistical models , 2001, SIGIR '01.

[85]  Mark Sanderson,et al.  User experiments with the Eurovision cross-language image retrieval system , 2006, J. Assoc. Inf. Sci. Technol..

[86]  Julio Gonzalo,et al.  Interactive Cross-Language Searching: Phrases are Better than Terms for Query Formulation and Refinement , 2002, CLEF.

[87]  Flavius Frasincar,et al.  Specification framework for engineering adaptive web applications , 2002 .

[88]  Dragomir R. Radev,et al.  Question-answering by predictive annotation , 2000, SIGIR '00.

[89]  Max Schroeder When you come to a fork in the road... take it , 2000 .

[90]  Jun-ichi Fukumoto,et al.  Automated Summarization Evaluation with Basic Elements. , 2006, LREC.

[91]  Carol Peters,et al.  CLEF 2004: Ad Hoc Track Overview and Results Analysis , 2004, CLEF.

[92]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[93]  Noriko Kando Evaluation-- the Way Ahead : A Case of the NTCIR , 2002 .

[94]  Carol Peters Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation , 2000 .

[95]  Ani Nenkova,et al.  Automatically Learning Cognitive Status for Multi-Document Summarization of Newswire , 2005, HLT/EMNLP.

[96]  Akshar Bharati,et al.  Panel: Computational Linguistics in India: An Overview , 2000, ACL.

[97]  Preben Hansen,et al.  Effects of foreign language and task scenario on relevance assessment , 2005, J. Documentation.

[98]  Kalervo Järvelin,et al.  Translating cross-lingual spelling variants using transformation rules , 2005, Inf. Process. Manag..

[99]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[100]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[101]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[102]  Bruno Pouliquen,et al.  Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications , 2006, ArXiv.

[103]  James Allan,et al.  A month to topic detection and tracking in Hindi , 2003, TALIP.