JHU/APL Experiments in Tokenization and Non-Word Translation

In the past we have conducted experiments that investigate the benefits and peculiarities attendant to alternative methods for tokenization, particularly overlapping character n-grams. This year we continued this line of work and report new findings reaffirming that the judicious use of n-grams can lead to performance surpassing that of word-based tokenization. In particular we examined: the relative performance of n-grams and a popular suffix stemmer; a novel form of n-gram indexing that approximates stemming and achieves fast run-time performance; various lengths of n-grams; and the use of n-grams for robust translation of queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted monolingual and bilingual runs for all languages and language pairs and multilingual runs using English as a source language. Our key findings are that shorter n-grams (n=4 and n=5) outperform a popular stemmer in non-Romance languages, that direct translation of n-grams is feasible using an aligned corpus, that translated 5-grams yield superior performance to words, stems, or 4-grams, and that a combination of indexing methods is best of all.

[1]  Carol Peters,et al.  Comparative Evaluation of Multilingual Information Access Systems , 2003, Lecture Notes in Computer Science.

[2]  James Mayfield,et al.  Comparing cross-language query expansion techniques by degrading translation resources , 2002, SIGIR '02.

[3]  Julio Gonzalo,et al.  Advances in Cross-Language Information Retrieval , 2002, Lecture Notes in Computer Science.

[4]  Stephen Tomlinson Experiments in 8 European Languages with Hummingbird SearchServer™ at CLEF2002 , 2002, CLEF.

[5]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[6]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[7]  James Mayfield,et al.  Single n-gram stemming , 2003, SIGIR.

[8]  James Mayfield,et al.  Scalable Multilingual Information Access , 2002, CLEF.

[9]  Maarten de Rijke,et al.  The University of Amsterdam at CLEF 2003 , 2001, CLEF.

[10]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[11]  James Mayfield,et al.  N-Grams for Translation and Retrieval in CL-SDR , 2003, CLEF.

[12]  Martin Porter,et al.  Snowball: A language for stemming algorithms , 2001 .

[13]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings , 2001, Information Retrieval.

[14]  JärvelinKalervo,et al.  Dictionary-Based Cross-Language Information Retrieval , 2004 .

[15]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[16]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[17]  Jacques Savoy,et al.  Cross-language information retrieval: experiments based on CLEF 2000 corpora , 2003, Inf. Process. Manag..

[18]  Kenneth Ward Church Char_align: A Program for Aligning Parallel Texts at the Character Level , 1993, ACL.

[19]  Kui-Lam Kwok,et al.  Improving two-stage ad-hoc retrieval for short queries , 1998, SIGIR '98.

[20]  Jack Perkins,et al.  Pattern recognition in practice , 1980 .

[21]  Dan Shen,et al.  Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System , 2000, J. Digit. Inf..

[22]  Djoerd Hiemstra,et al.  Cross-language Retrieval at Twente and TNO , 2002, CLEF.