Monolingual Retrieval for European Languages

Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques. What exactly the best combination of settings is, proved to be highly language dependent in our experiments.

[1]  Carol Peters,et al.  European research letter: Cross-language system evaluation: The CLEF campaigns , 2001, J. Assoc. Inf. Sci. Technol..

[2]  Christa Womser-Hacker Multilingual Topic Generation within the CLEF 2001 Experiments , 2001, CLEF.

[3]  Turid Hedlund,et al.  UTACLIR @ CLEF 2002: Towards a Unified Translation Process Model , 2002, CLEF.

[4]  Ari Pirkola,et al.  Morphological typology of languages for IR , 2001, J. Documentation.

[5]  Kevin P. Jones,et al.  Towards everyday language information retrieval systems via minicomputers , 1979, J. Am. Soc. Inf. Sci..

[6]  Alan L. Tharp,et al.  Accelerating Text Searching through Signature Trees. , 1990 .

[7]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[8]  Stephen Tomlinson Stemming Evaluated in 6 Languages by Hummingbird SearchServerTM at CLEF 2001 , 2001, CLEF.

[9]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[10]  James Mayfield,et al.  Indexing Using Both N-Grams and Words , 1998, TREC.

[11]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[12]  Jacques Savoy,et al.  Report on CLEF-2002 Experiments: Combining Multiple Sources of Evidence , 2002, CLEF.

[13]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[14]  L. Whaley Introduction to Typology: The Unity and Diversity of Language , 1996 .

[15]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[16]  Turid Hedlund,et al.  Compounds in dictionary-based cross-language information retrieval , 2002, Inf. Res..

[17]  Peter Willett Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms , 1979, J. Documentation.

[18]  Carol Peters,et al.  Evaluation of cross-language information retrieval systems : Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September 3-4, 2001 : revised papers , 2002 .

[19]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[20]  Ángel F. Zazo Rodríguez,et al.  Spanish Monolingual Track: The Impact of Stemming on Retrieval , 2001, CLEF.

[21]  T. de Heer The application of the concept of homeosemy to natural language information retrieval , 1982, Inf. Process. Manag..

[22]  Isabelle Moulinier,et al.  West Group at CLEF2000: Non-English Monolingual Retrieval , 2000, CLEF.

[23]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[24]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[25]  Wessel Kraaij,et al.  Comparing the Effect of Syntactic vs. Statistical Phrase Indexing Strategies for Dutch , 1998, ECDL.

[26]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[27]  Turid Hedlund,et al.  Utaclir @ CLEF 2001 - Effects of Compound Splitting and N-Gram Techniques , 2001, CLEF.

[28]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[29]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[30]  Gunlög Josefsson,et al.  On the principles of word formation in Swedish , 1997 .

[31]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[32]  E. Ziegel,et al.  Bootstrapping: A Nonparametric Approach to Statistical Inference , 1993 .

[33]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[34]  James Mayfield,et al.  Scalable Multilingual Information Access , 2002, CLEF.

[35]  Aitao Chen,et al.  Cross-language Retrieval Experiments at CLEF 2002 , 2002, CLEF.

[36]  J. L. Wisniewski Effective text compression with simultaneous digram and trigram encoding , 1987, J. Inf. Sci..

[37]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[38]  James Mayfield,et al.  JHU/APL Experiments at CLEF: Translation Resources and Score Normalization , 2001, CLEF.

[39]  David Cooper,et al.  Document Retrieval Experiments using Indexing Vocabularies of varying Size. I. Variety Generation Symbols Assigned to the Fronts of Index Terms , 1979, J. Documentation.

[40]  Alan L. Tharp,et al.  Accelerating text searching through signature trees , 1990, J. Am. Soc. Inf. Sci..

[41]  W. John Wilbur,et al.  Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..

[42]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[43]  R. Baayen,et al.  Analogy in morphology: modeling the choice of linking morphemes in Dutch , 2001 .

[44]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[45]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[46]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[47]  Martin Braschler,et al.  Stemming and Decompounding for German Text Retrieval , 2003, ECIR.

[48]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[49]  Ari Pirkola,et al.  Studies on Linguistic Problems and Methods in Text Retrieval: The Effects of Anaphor and Ellipsis Resolution in Proximity Searching, and Translation and Query Structuring Methods in Cross-Language Retrieval , 1999 .

[50]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[51]  Carol Peters,et al.  Cross-Language Information Retrieval and Evaluation , 2001, Lecture Notes in Computer Science.

[52]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[53]  Nicola Ferro,et al.  University of Padua at CLEF 2002: Experiments to Evaluate a Statistical Stemming Algorithm , 2002, CLEF.

[54]  Stephen Tomlinson Experiments in 8 European Languages with Hummingbird SearchServer™ at CLEF2002 , 2002, CLEF.

[55]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[56]  Jacques Savoy Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach , 2001, CLEF.

[57]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[58]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.