Selectivity estimation of twig queries on cyclic graphs

Recent applications including the Semantic Web, Web ontology and XML have sparked a renewed interest on graph-structured databases. Among others, twig queries have been a popular tool for retrieving subgraphs from graph-structured databases. To optimize twig queries, selectivity estimation has been a crucial and classical step. However, the majority of existing works on selectivity estimation focuses on relational and tree data. In this paper, we investigate selectivity estimation of twig queries on possibly cyclic graph data. To facilitate selectivity estimation on cyclic graphs, we propose a matrix representation of graphs derived from prime labeling — a scheme for reachability queries on directed acyclic graphs. With this representation, we exploit the consecutive ones property (C1P) of matrices. As a consequence, a node is mapped to a point in a two-dimensional space whereas a query is mapped to multiple points. We adopt histograms for scalable selectivity estimation. We perform an extensive experimental evaluation on the proposed technique and show that our technique controls the estimation error under 1.3% on XMARK and DBLP, which is more accurate than previous techniques. On TREEBANK, we produce RMSE and NRMSE 6.8 times smaller than previous techniques.

[1]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[2]  Hongjun Lu,et al.  Bloom Histogram: Path Selectivity Estimation for XML Data with Updates , 2004, VLDB.

[3]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[4]  Sebastian Maneth,et al.  Structural Selectivity Estimation for XML Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  J. Widom,et al.  Approximate DataGuides , 1998 .

[6]  Wen-Lian Hsu,et al.  A Simple Test for the Consecutive Ones Property , 1992, J. Algorithms.

[7]  Divesh Srivastava,et al.  Counting twig matches in a tree , 2001, Proceedings 17th International Conference on Data Engineering.

[8]  Jeffrey Scott Vitter,et al.  XPathLearner: An On-line Self-Tuning Markov Histogram for XML Path Selectivity Estimation , 2002, VLDB.

[9]  Gang Wu,et al.  Adapting Prime Number Labeling Scheme for Directed Acyclic Graphs , 2006, DASFAA.

[10]  Jeffrey F. Naughton,et al.  Estimating the Selectivity of XML Path Expressions for Internet Scale Applications , 2001, VLDB.

[11]  Alexander Borgida,et al.  Efficient management of transitive relationships in large data and knowledge bases , 1989, SIGMOD '89.

[12]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[13]  Neoklis Polyzotis,et al.  Approximate XML query answers , 2004, SIGMOD '04.

[14]  Louxin Zhang,et al.  The Consecutive Ones Submatrix Problem for Sparse Matrices , 2007, Algorithmica.

[15]  Neoklis Polyzotis,et al.  XSKETCH synopses for XML data graphs , 2006, TODS.

[16]  Jignesh M. Patel,et al.  Using histograms to estimate answer sizes for XML queries , 2003, Inf. Syst..

[17]  M. Tamer Özsu,et al.  XSEED: Accurate and Fast Cardinality Estimation for XPath Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Ehud Gudes,et al.  Exploiting local similarity for indexing paths in graph-structured data , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Juliana Freire,et al.  StatiX: making XML count , 2002, SIGMOD '02.

[20]  Edith Cohen,et al.  Reachability and distance queries via 2-hop labels , 2002, SODA '02.

[21]  X. Wu,et al.  A prime number labeling scheme for dynamic ordered XML trees , 2004, Proceedings. 20th International Conference on Data Engineering.

[22]  Bingsheng He,et al.  A Quantitative Summary of XML Structures , 2006, ER.