The Importance of Length Normalization for XML Retrieval

XML retrieval is a departure from standard document retrieval in which each individual XML element, ranging from italicized words or phrases to full blown articles, is a retrievable unit. The distribution of XML element lengths is unlike what we usually observe in standard document collections, prompting us to revisit the issue of document length normalization. We perform a comparative analysis of arbitrary elements versus relevant elements, and show the importance of element length as a parameter for XML retrieval. Within the language modeling framework, we investigate a range of techniques that deal with length either directly or indirectly. We observe a length-bias introduced by the amount of smoothing, and show the importance of extreme length bias for XML retrieval. We also show that simply removing shorter elements from the index (by introducing a cut-off value) does not create an appropriate element length normalization. Even after restricting the minimal size of XML elements occurring in the index, the importance of an extreme explicit length bias remains.

[1]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[2]  Ross Wilkinson,et al.  Effective retrieval of structured documents , 1994, SIGIR '94.

[3]  Djoerd Hiemstra,et al.  Twenty-One at TREC-8: using Language Technology for Information Retrieval , 1999, TREC.

[4]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[5]  Donna K. Harman,et al.  Overview of the TREC 2002 Novelty Track , 2002, TREC.

[6]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[7]  W. John Wilbur,et al.  Non-parametric significance tests of retrieval performance comparisons , 1994, J. Inf. Sci..

[8]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[9]  Djoerd Hiemstra,et al.  The TIJAH XML-IR system at INEX 2003 , 2003, INEX.

[10]  M. de Rijke,et al.  An Element-based Approach to XML Retrieval , 2004 .

[11]  Wessel Kraaij,et al.  TNO-UT at TREC-9: How Different are Web Documents? , 2000, TREC.

[12]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[13]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[14]  Djoerd Hiemstra,et al.  The Importance of Prior Probabilities for Entry Page Search , 2002, SIGIR '02.

[15]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..

[16]  Ellen M. Voorhees,et al.  The eleventh text REtrieval conference, TREC 2002 , 2003 .

[17]  James P. Callan,et al.  Language Models and Structured Document Retrieval , 2002, INEX Workshop.

[18]  Wessel Kraaij,et al.  Variations on language modeling for information retrieval , 2005, SIGF.

[19]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[20]  Djoerd Hiemstra,et al.  A Database Approach to Content-based XML Retrieval , 2002, INEX Workshop.

[21]  William T. Morgan,et al.  Contributions of Language Modeling to the Theory and Practice of Information Retrieval , 2003 .

[22]  Richard M. Schwartz,et al.  A hidden Markov model information retrieval system , 1999, SIGIR '99.

[23]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[24]  M. de Rijke,et al.  Topic Field Selection and Smoothing for XML Retrieval , 2003 .

[25]  Maarten de Rijke,et al.  XML retrieval: what to retrieve? , 2003, SIGIR '03.

[26]  Arjen P. de Vries,et al.  CWI at INEX 2002 , 2002, INEX Workshop.

[27]  David Carmel,et al.  Searching XML documents via XML fragments , 2003, SIGIR.

[28]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[29]  John D. Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR '99.

[30]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[31]  Paul Ogilvie,et al.  Using Language Models for Flat Text Queries in XML Retrieval , 2003 .

[32]  Ellen M. Voorhees,et al.  Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[33]  David Carmel,et al.  JuruXML - an XML Retrieval System at INEX'02 , 2002, INEX Workshop.