Top-k keyword search over probabilistic XML data

Despite the proliferation of work on XML keyword query, it remains open to support keyword query over probabilistic XML data. Compared with traditional keyword search, it is far more expensive to answer a keyword query over probabilistic XML data due to the consideration of possible world semantics. In this paper, we firstly define the new problem of studying top-k keyword search over probabilistic XML data, which is to retrieve k SLCA results with the k highest probabilities of existence. And then we propose two efficient algorithms. The first algorithm PrStack can find k SLCA results with the k highest probabilities by scanning the relevant keyword nodes only once. To further improve the efficiency, we propose a second algorithm EagerTopK based on a set of pruning properties which can quickly prune unsatisfied SLCA candidates. Finally, we implement the two algorithms and compare their performance with analysis of extensive experimental results.

[1]  Yehoshua Sagiv,et al.  Query efficiency in probabilistic XML models , 2008, SIGMOD Conference.

[2]  Cong Yu,et al.  Schema-Free XQuery , 2004, VLDB.

[3]  Jeffrey Xu Yu,et al.  Query ranking in probabilistic XML data , 2009, EDBT '09.

[4]  V. S. Subrahmanian,et al.  Probabilistic interval XML , 2003, TOCL.

[5]  Lin Guo XRANK : Ranked Keyword Search over XML Documents , 2003 .

[6]  Yannis Papakonstantinou,et al.  Efficient keyword search for smallest LCAs in XML databases , 2005, SIGMOD '05.

[7]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[8]  Serge Abiteboul,et al.  On the expressiveness of probabilistic XML models , 2009, The VLDB Journal.

[9]  Yi Chen,et al.  Structured Search Result Differentiation , 2009, Proc. VLDB Endow..

[10]  K. Pu,et al.  Keyword query cleaning , 2008, Proc. VLDB Endow..

[11]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[12]  Tok Wang Ling,et al.  Effective XML Keyword Search with Relevance Oriented Ranking , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Jianyong Wang,et al.  Effective keyword search for valuable lcas over xml documents , 2007, CIKM '07.

[14]  Yehoshua Sagiv,et al.  Matching Twigs in Probabilistic XML , 2007, VLDB.

[15]  Chee Yong Chan,et al.  Multiway SLCA-based keyword search in XML data , 2007, WWW '07.

[16]  Jianxin Li,et al.  Suggestion of promising result types for XML keyword search , 2010, EDBT '10.

[17]  Sarath Kumar Kondreddi,et al.  A Probabilistic XML Approach to Data Integration , 2009 .

[18]  Serge Abiteboul,et al.  Querying and Updating Probabilistic Information in XML , 2006, EDBT.

[19]  Yehoshua Sagiv,et al.  Query evaluation over probabilistic XML , 2009, The VLDB Journal.

[20]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[21]  Serge Abiteboul,et al.  On the complexity of managing probabilistic XML data , 2007, PODS '07.

[22]  Yi Chen,et al.  Reasoning and identifying relevant matches for XML keyword search , 2008, Proc. VLDB Endow..

[23]  Ioana Manolescu,et al.  Integrating Keyword Search into XML Query Processing , 2000, BDA.

[24]  Yehoshua Sagiv,et al.  Incorporating constraints in probabilistic XML , 2009, TODS.