论文信息 - Flexible document-query matching based on a probabilistic content and structure score combination

Flexible document-query matching based on a probabilistic content and structure score combination

The goal of an XML retrieval system is to select from a set of XML documents all elements (nodes) that fit the user information need, usually expressed by a set of keywords with some structural conditions. Structural conditions are simply given by an ordered list of tag names that gives the target element where to search for relevant content. Consequently a potential relevant node should not only contain similar text to the query but also its localization path should fit the structural conditions. We describe in this paper a new approach for ranking XML content-and-structure queries based on a probabilistic combination of two independent scores assigned to each XML element: content score and structural score. Content score measures the content similarity between an element and a query, the structural score measures the path similarity between an element path and the structural conditions of a query. We showed experimentally that both scores follow well-known distributions. We then proposed a probabilistic combination of these distributions in order to assign a final score to each node. Some experiments have been undertaken on a dataset provided by INEX to show the effectiveness of our approach. We emphasize our experiments on the VVCAS task which is appropriate to our model.

Mohand Boughanem | Mohamed Benaouicha | Mohamed Tmar

[1] Mohamed Abid,et al. Experiments on Element and Document Statistics for XML Retrieval based on tree matching , 2008 .

[2] Mohand Boughanem,et al. Why Using Structural Hints in XML Retrieval? , 2006, FQAS.

[3] Donald D. Chamberlin,et al. XQuery: a query language for XML , 2003, SIGMOD '03.

[4] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[5] Martin Theobald. TopX: efficient and versatile top-k query processing for text, structured, and semistructured data , 2006 .

[6] Mohand Boughanem,et al. Searching XML Documents Using Relevance Propagation , 2004, SPIRE.

[7] Gabriella Kazai,et al. eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval , 2006, TOIS.

[8] James P. Callan,et al. Parameter Estimation for a Simple Hierarchical Generative Model for XML Retrieval , 2005, INEX.

[9] Norbert Fuhr,et al. XIRQL: a query language for information retrieval in XML documents , 2001, SIGIR '01.

[10] Pierre-François Marteau,et al. SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005 , 2005, INEX.

[11] Gerhard Weikum,et al. TopX: efficient and versatile top-k query processing for semistructured data , 2007, The VLDB Journal.