Monolingual Document Retrieval for European Languages

Recent years have witnessed considerable advances in information retrieval for European languages other than English. We give an overview of commonly used techniques and we analyze them with respect to their impact on retrieval effectiveness. The techniques considered range from linguistically motivated techniques, such as morphological normalization and compound splitting, to knowledge-free approaches, such as n-gram indexing. Evaluations are carried out against data from the CLEF campaign, covering eight European languages. Our results show that for many of these languages a modicum of linguistic techniques may lead to improvements in retrieval effectiveness, as can the use of language independent techniques.

[1]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[2]  Nicola Ferro,et al.  University of Padua at CLEF 2002: Experiments to Evaluate a Statistical Stemming Algorithm , 2002, CLEF.

[3]  Stephen Tomlinson Experiments in 8 European Languages with Hummingbird SearchServer™ at CLEF2002 , 2002, CLEF.

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Turid Hedlund,et al.  Utaclir @ CLEF 2001 - Effects of Compound Splitting and N-Gram Techniques , 2001, CLEF.

[6]  James Mayfield,et al.  Scalable Multilingual Information Access , 2002, CLEF.

[7]  Jacques Savoy Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach , 2001, CLEF.

[8]  David Cooper,et al.  Document Retrieval Experiments using Indexing Vocabularies of varying Size. I. Variety Generation Symbols Assigned to the Fronts of Index Terms , 1979, J. Documentation.

[9]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[10]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[11]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[12]  Alan L. Tharp,et al.  Accelerating text searching through signature trees , 1990, J. Am. Soc. Inf. Sci..

[13]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[14]  Gunlög Josefsson,et al.  On the principles of word formation in Swedish , 1997 .

[15]  James Mayfield,et al.  Indexing Using Both N-Grams and Words , 1998, TREC.

[16]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[17]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[18]  L. Whaley Introduction to Typology: The Unity and Diversity of Language , 1996 .

[19]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[20]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[21]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[22]  Jacques Savoy,et al.  Report on CLEF-2002 Experiments: Combining Multiple Sources of Evidence , 2002, CLEF.

[23]  Carol Peters,et al.  European research letter: Cross-language system evaluation: The CLEF campaigns , 2001, J. Assoc. Inf. Sci. Technol..

[24]  Christa Womser-Hacker Multilingual Topic Generation within the CLEF 2001 Experiments , 2001, CLEF.

[25]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[26]  Aitao Chen,et al.  Cross-language Retrieval Experiments at CLEF 2002 , 2002, CLEF.

[27]  J. L. Wisniewski Effective text compression with simultaneous digram and trigram encoding , 1987, J. Inf. Sci..

[28]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[29]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[30]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[31]  Peter Willett Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms , 1979, J. Documentation.

[32]  Turid Hedlund,et al.  Compounds in dictionary-based cross-language information retrieval , 2002, Inf. Res..

[33]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[34]  Wessel Kraaij,et al.  Comparing the Effect of Syntactic vs. Statistical Phrase Indexing Strategies for Dutch , 1998, ECDL.

[35]  R. Baayen,et al.  Analogy in morphology: modeling the choice of linking morphemes in Dutch , 2001 .

[36]  Turid Hedlund,et al.  UTACLIR @ CLEF 2002: Towards a Unified Translation Process Model , 2002, CLEF.

[37]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[38]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[39]  Ari Pirkola,et al.  Studies on Linguistic Problems and Methods in Text Retrieval: The Effects of Anaphor and Ellipsis Resolution in Proximity Searching, and Translation and Query Structuring Methods in Cross-Language Retrieval , 1999 .

[40]  Carol Peters,et al.  Cross-Language Information Retrieval and Evaluation , 2001, Lecture Notes in Computer Science.

[41]  Ari Pirkola,et al.  Morphological typology of languages for IR , 2001, J. Documentation.

[42]  Kevin P. Jones,et al.  Towards everyday language information retrieval systems via minicomputers , 1979, J. Am. Soc. Inf. Sci..

[43]  Alan L. Tharp,et al.  Accelerating Text Searching through Signature Trees. , 1990 .

[44]  Stephen Tomlinson Stemming Evaluated in 6 Languages by Hummingbird SearchServerTM at CLEF 2001 , 2001, CLEF.

[45]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[46]  Djoerd Hiemstra,et al.  Translation Resources, Merging Strategies, and Relevance Feedback for Cross-Language Information Retrieval , 2000, CLEF.

[47]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[48]  Martin Braschler,et al.  Stemming and Decompounding for German Text Retrieval , 2003, ECIR.

[49]  E. Ziegel,et al.  Bootstrapping: A Nonparametric Approach to Statistical Inference , 1993 .

[50]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[51]  W. John Wilbur,et al.  Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..

[52]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[53]  J. Cleary,et al.  \self-organized Language Modeling for Speech Recognition". In , 1997 .

[54]  Yunheng Ji MORPHOLOGY , 1937, A Grammar of Italian Sign Language (LIS).

[55]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[56]  E. Dura Natural Language in Information Retrieval , 2003, CICLing.

[57]  Isabelle Moulinier,et al.  West Group at CLEF2000: Non-English Monolingual Retrieval , 2000, CLEF.

[58]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[59]  James Mayfield,et al.  JHU/APL Experiments at CLEF: Translation Resources and Score Normalization , 2001, CLEF.

[60]  Carol Peters,et al.  Evaluation of cross-language information retrieval systems : Second Workshop of the Cross-Language Evaluation Forum, CLEF 2001, Darmstadt, Germany, September 3-4, 2001 : revised papers , 2002 .

[61]  Ángel F. Zazo Rodríguez,et al.  Spanish Monolingual Track: The Impact of Stemming on Retrieval , 2001, CLEF.

[62]  T. de Heer The application of the concept of homeosemy to natural language information retrieval , 1982, Inf. Process. Manag..