Stemming and n-grams in Spanish: an evaluation of their impact on information retrieval

At some stage, most of the models and techniques implemented in information retrieval use frequency counts of the terms appearing in documents and in queries. However, many words, since they are derived from the same stem, have very close semantic content. This makes a grouping of such variants under a single term advisable. Otherwise, dispersal occurs in the calculation of frequency of these terms and it also becomes difficult to compare queries and documents. On the other hand, there are notable differences between different languages in the way of forming derivatives and inflected forms, so that the application of specific techniques can produce unequal results according to the language of the documents and queries. A description is given of tests carried out for documents in Spanish, which involved some stemming techniques widely used in English, as well as the application of n-grams, and the results are compared.

[1]  W. B. Cavnar,et al.  Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[2]  Tudora Sandru Olteanu,et al.  Real Academia Española, Diccionario de la Lengua Española , 20ª edición, Madrid, 1984 , 2013 .

[3]  Tengku Mohd Tengku Sembok,et al.  Experiments with a stemming algorithm for Malay words , 1996 .

[4]  Antonio Zamora,et al.  System design for detection and correction of spelling errors in scientific and scholarly text , 1984, J. Am. Soc. Inf. Sci..

[5]  María Moliner,et al.  Diccionario de uso del español , 2000 .

[6]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[7]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[8]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[9]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[10]  Chris D. Paice,et al.  Another stemmer , 1990, SIGF.

[11]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[12]  Mohammed Yusoff,et al.  Experiments with a Stemming Algorithm for Malay Words , 1996, J. Am. Soc. Inf. Sci..

[13]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[14]  Peter Willett,et al.  A Stemming Algorithm for Latin Text Databases , 1996, J. Documentation.

[15]  Martha W. Evens,et al.  Stemming methodologies over individual query words for an Arabic information retrieval system , 1999 .

[16]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[17]  Martha W. Evens,et al.  Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System , 1999, J. Am. Soc. Inf. Sci..

[18]  Peter Willett,et al.  An evaluation of some conflation algorithms for information retrieval , 1981 .

[19]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[20]  Stephen Huffman Acquaintance: Language-Independent Document Categorization by N-Grams , 1995, TREC.

[21]  Jacques Savoy,et al.  Stemming of French Words Based on Grammatical Categories , 1993, J. Am. Soc. Inf. Sci..

[22]  Donna Harman,et al.  The fourth text REtrieval conference , 1996 .

[23]  George W. Adamson,et al.  The use of an association measure based on character structure to identify semantically related pairs of words and document titles , 1974, Inf. Storage Retr..

[24]  Jacques Savoy,et al.  A Stemming Procedure and Stopword List for General French Corpora , 1999, J. Am. Soc. Inf. Sci..

[25]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[26]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[27]  José Vicente Rodríguez Muñoz,et al.  Análisis de los descriptores de diferentes áreas del conocimiento indizadas en bases de datos del CSIC. Aplicación a la indización automática , 1997 .

[28]  Peter Willett,et al.  Applications of n-grams in textual information systems , 1998, J. Documentation.

[29]  Manuel Fernando Pérez Lagos Formación de palabras: la composición culta en los diccionarios (drae-vox) , 1996 .

[30]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[31]  Donna K. Harman,et al.  The TREC Conferences , 1997, HIM.

[32]  Chris D. Paice Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[33]  J. A. Bolúfer,et al.  Diccionario de la lengua española , 1917 .