Stemming Galician Texts

In this paper we describe a stemming algorithm for Galician language, which supports, at the same time, the four current orthographic regulations for Galician. The algorithm has already been implemented, and we have started to use it for its improvement. But this stemming algorithm cannot be applied over documents previous to the appearance of the first Galician orthographic regulation in 1977; therefore we have adopted an exhaustive approach, consisting in defining a huge collection of wordsets for allowing systematic word comparisons, to stem documents written before that date. We also describe here a tool to build the wordsets needed in this approach.

[1]  S. Wurm,et al.  Atlas of the World's Languages in Danger of Disappearing , 2001 .

[2]  C. Huyck,et al.  A stemming algorithm for the portuguese language , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[3]  Peter D. Smith,et al.  Files & databases: an introduction , 1986 .

[4]  X. R. F. Mato Lingua galega : normalidade e conflito , 1997 .

[5]  Peter Schäuble,et al.  Multl-Language Text Indexing for Internet Retrieval , 1997, RIAO.

[6]  Peter D. Smith,et al.  Files and databases : an introduction , 1987 .

[7]  X. Ramón Freixeiro Mato Gramática da lingua galega , 1998 .

[8]  Ruben Leon,et al.  A word stemming algorithm for the Spanish language , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.