Studying the effect and treatment of misspelled queries in Cross-Language Information Retrieval

We study the effects of misspelled queries on the performance of CLIR systems.Word-based approaches (as both indexing and translation units) are highly sensitive to the presence of misspellings.The use of correction mechanisms can significantly reduce their negative effects.Classical techniques are suitable for shorter queries while context-based corrections are suitable for longer queries.Our approach based on character n-grams (as both indexing and translation units) shows remarkable strength. In contrast with their monolingual counterparts, little attention has been paid to the effects that misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in order to study the impact that the progressive addition of misspellings to input queries has, this time, on the output of CLIR systems. Two approaches for dealing with this problem are analyzed in this paper. Firstly, the use of automatic spelling correction techniques for which, in turn, we consider two algorithms: the first one for the correction of isolated words and the second one for a correction based on the linguistic context of the misspelled word. The second approach to be studied is the use of character n-grams both as index terms and translation units, seeking to take advantage of their inherent robustness and language-independence. All these approaches have been tested on a from-Spanish-to-English CLIR system, that is, Spanish queries on English documents. Real, user-generated spelling errors have been used under a methodology that allows us to study the effectiveness of the different approaches to be tested and their behavior when confronted with different error rates. The results obtained show the great sensitiveness of classic word-based approaches to misspelled queries, although spelling correction techniques can mitigate such negative effects. On the other hand, the use of character n-grams provides great robustness against misspellings.

[1]  Miguel A. Alonso,et al.  A linguistic approach for determining the topics of Spanish Twitter messages , 2015, J. Inf. Sci..

[2]  Manuel Vilares Ferro,et al.  On Asymptotic Finite-State Error Repair , 2004, SPIRE.

[3]  Walid Magdy,et al.  Error correction vs. query garbling for Arabic OCR document retrieval , 2007, TOIS.

[4]  Jesús Vilares,et al.  Formal Methods of Tokenization for Part-of-Speech Tagging , 2002, CICLing.

[5]  Heng Ji,et al.  A study of using an out-of-box commercial MT system for query translation in CLIR , 2008, iNEWS '08.

[6]  Carol Peters,et al.  Multilingual Information Retrieval , 2012, Springer Berlin Heidelberg.

[7]  Peter Willett,et al.  Applications of n-grams in textual information systems , 1998, J. Documentation.

[8]  Agata Savary Typographical Nearest-Neighbor Search in a Finite-State Lexicon and Its Application to Spelling Correction , 2001, CIAA.

[9]  Gregory Grefenstette,et al.  Cross-Language Information Retrieval , 1998, The Springer International Series on Information Retrieval.

[10]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[11]  W. Bruce Croft,et al.  Analysis of long queries in a large scale search log , 2009, WSCD '09.

[12]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[13]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[14]  Miguel A. Alonso,et al.  On the feasibility of character n-grams pseudo-translation for Cross-Language Information Retrieval tasks , 2016, Comput. Speech Lang..

[15]  Miguel A. Alonso,et al.  On the usefulness of lexical and syntactic processing in polarity classification of Twitter messages , 2015, J. Assoc. Inf. Sci. Technol..

[16]  Gareth J. F. Jones,et al.  Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR , 2010, TALIP.

[17]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[18]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[19]  Stefan Evert,et al.  The Statistics of Word Cooccur-rences: Word Pairs and Collocations , 2004 .

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Iadh Ounis,et al.  Research directions in Terrier: a search engine for advanced retrieval on the Web , 2007 .

[22]  Miguel A. Alonso,et al.  A Common Solution for Tokenization and Part-of-Speech Tagging , 2002, TSD.

[23]  George Papadakis,et al.  Content vs. context for sentiment analysis: a comparative analysis over microblogs , 2012, HT '12.

[24]  Yang Liu,et al.  Normalization of informal text , 2014, Comput. Speech Lang..

[25]  Jian-Yun Nie Cross-Language Information Retrieval , 2010, Cross-Language Information Retrieval.

[26]  Jorge Graña,et al.  Contextual spelling correction , 2007 .

[27]  Youngjoong Ko,et al.  Combining lexical and statistical translation evidence for cross‐language information retrieval , 2015, J. Assoc. Inf. Sci. Technol..

[28]  Carol Peters,et al.  CLEF 2006: Ad Hoc Track Overview , 2006, CLEF.

[29]  James Mayfield,et al.  JHU/APL Experiments in Tokenization and Non-Word Translation , 2003, CLEF.

[30]  Miguel A. Alonso,et al.  A syntactic approach for opinion mining on Spanish reviews , 2013, Natural Language Engineering.

[31]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[32]  Manuel Vilares Ferro,et al.  Managing misspelled queries in IR applications , 2011, Inf. Process. Manag..

[33]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[34]  Hang Li,et al.  A unified and discriminative model for query refinement , 2008, SIGIR '08.

[35]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[36]  Krisztian Balog,et al.  Extended Overview of the Living Labs for Information Retrieval Evaluation (LL4IR) CLEF Lab 2015 , 2015, CLEF.