The Importance of Morphological Normalization for XML Retrieval

Current information retrieval systems typically ignore structural aspects of documents, solely focusing on the textual content instead. But documents containing additional structure in the form of HTML, XML, or SGML mark-up are pervasive on the Internet. The XML retrieval task presents a number of challenges for information retrieval, for we can no longer rely on the appropriate unit of retrieval to be fixed, or to be known beforehand. This implies that the effectiveness of standard IR techniques, such as morphological normalization methods, may not carry over to this particular task. This paper describes the fully automatic runs for the INEX 2002 task submitted by the Language and Inference Technology Group at the University of Amsterdam. We investigate the effectiveness of two standard approaches to morphological normalization, both a linguistically otivated stemming algorithm and a knowledge-poor character n-gramming technique. Our results show that morphological normalization is an important issue for XML retrieval. For all measurements, the combined run and the n-gram run perform better than the stemmed run.

[1]  Donna K. Harman,et al.  Overview of the TREC 2002 Novelty Track , 2002, TREC.

[2]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[5]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[6]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[7]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[8]  Garrison W. Cottrell,et al.  Predicting the performance of linearly combined IR systems , 1998, SIGIR '98.

[9]  Jaap Kamps,et al.  The University of Amsterdam at INEX 2006 , 2002 .

[10]  Maarten de Rijke,et al.  Monolingual Retrieval for European Languages , 2003 .

[11]  Maarten de Rijke,et al.  Combining Evidence for Cross-Language Information Retrieval , 2002, CLEF.

[12]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[13]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[14]  Gabriella Kazai,et al.  Overview of the Initiative for the Evaluation of XML retrieval (INEX) 2002 , 2002, INEX Workshop.

[15]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.