Flexible document-query matching based on a probabilistic content and structure score combination

The goal of an XML retrieval system is to select from a set of XML documents all elements (nodes) that fit the user information need, usually expressed by a set of keywords with some structural conditions. Structural conditions are simply given by an ordered list of tag names that gives the target element where to search for relevant content. Consequently a potential relevant node should not only contain similar text to the query but also its localization path should fit the structural conditions. We describe in this paper a new approach for ranking XML content-and-structure queries based on a probabilistic combination of two independent scores assigned to each XML element: content score and structural score. Content score measures the content similarity between an element and a query, the structural score measures the path similarity between an element path and the structural conditions of a query. We showed experimentally that both scores follow well-known distributions. We then proposed a probabilistic combination of these distributions in order to assign a final score to each node. Some experiments have been undertaken on a dataset provided by INEX to show the effectiveness of our approach. We emphasize our experiments on the VVCAS task which is appropriate to our model.

[1]  Mohamed Abid,et al.  Experiments on Element and Document Statistics for XML Retrieval based on tree matching , 2008 .

[2]  Mohand Boughanem,et al.  Why Using Structural Hints in XML Retrieval? , 2006, FQAS.

[3]  Donald D. Chamberlin,et al.  XQuery: a query language for XML , 2003, SIGMOD '03.

[4]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[5]  Martin Theobald TopX: efficient and versatile top-k query processing for text, structured, and semistructured data , 2006 .

[6]  Mohand Boughanem,et al.  Searching XML Documents Using Relevance Propagation , 2004, SPIRE.

[7]  Gabriella Kazai,et al.  eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval , 2006, TOIS.

[8]  James P. Callan,et al.  Parameter Estimation for a Simple Hierarchical Generative Model for XML Retrieval , 2005, INEX.

[9]  Norbert Fuhr,et al.  XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[10]  Pierre-François Marteau,et al.  SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005 , 2005, INEX.

[11]  Gerhard Weikum,et al.  TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.

[12]  Maarten de Rijke,et al.  XML retrieval: what to retrieve? , 2003, SIGIR '03.

[13]  Armin B. Cremers,et al.  Searching and browsing collections of structural information , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[14]  N. Fuhr PAN-Uncovering Plagiarism , Authorship , and Social Software Misuse ImageCLEF 2013-Cross Language Image Annotation and Retrieval INEX-INitiative for the Evaluation of XML retrieval , 2002 .

[15]  David Carmel,et al.  JuruXML - an XML Retrieval System at INEX'02 , 2002, INEX Workshop.

[16]  Ray R. Larson,et al.  A Fusion Approach to XML Structured Document Retrieval , 2005, Information Retrieval.

[17]  Torsten Schlieder,et al.  Querying and ranking XML documents , 2002, J. Assoc. Inf. Sci. Technol..

[18]  Sihem Amer-Yahia,et al.  XQuery Full-Text extensions explained , 2006, IBM Syst. J..

[19]  Jaana Kekäläinen,et al.  Generalized contextualization method for XML information retrieval , 2005, CIKM '05.

[20]  Mounia Lalmas,et al.  Dempster-Shafer's theory of evidence applied to structured documents: modelling uncertainty , 1997, SIGIR '97.

[21]  Michael Fuller,et al.  Structured answers for a large structured document collection , 1993, SIGIR.