The view selection problem for XML content based routing

We consider the view selection problem for XML content based routing: given a network, in which a stream of XML documents is routed and the routing decisions are taken based on results of evaluating XPath predicates on these documents, select a set of views that maximize the throughput of the network. While in view selection for relational queries the speedup comes from eliminating joins, here the speedup is obtained from gaining direct access to data values in an XML packet, without parsing that packet. The views in our context can be seen as a binary representation of the XML document, tailored for the network's workload.In this paper we define formally the view selection problem in the context of XML content based routing, and provide a practical solution for it. First, we formalize the problem; while the exact formulation is too complex to admit practical solutions, we show that it can be simplified to a manageable optimization problem, without loss in precision. Second we show that the simplified problem can be reduced to the Integer Cover problem. The Integer Cover problem is known to be NP-hard, and to admit a log n greedy approximation algorithm. Third, we show that the same greedy approximation algorithm performs much better on a class of work-loads called 'hierarchical workloads', which are typical in XML stream processing. Namely, it returns an optimal solution for hierarchical workloads, and degrades gracefully to the log n general bound as the workload becomes less hierarchical.

[1]  Alex C. Snoeren,et al.  Mesh-based content routing using XML , 2001, SOSP.

[2]  Michael J. Franklin,et al.  Efficient Filtering of XML Documents for Selective Dissemination of Information , 2000, VLDB.

[3]  Yanlei Diao,et al.  YFilter: efficient and scalable filtering of XML documents , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Rajeev Rastogi,et al.  Efficient filtering of XML documents with XPath expressions , 2002, The VLDB Journal.

[5]  Gregory Dobson,et al.  Worst-Case Analysis of Greedy Heuristics for Integer Programming with Nonnegative Data , 1982, Math. Oper. Res..

[6]  G. Dobson,et al.  Greedy Heuristics for Integer Programming with Non-negative Data , 2022 .

[7]  Stavros G. Kolliopoulos,et al.  Tight approximation results for general covering integer programs , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[8]  Dan Suciu,et al.  XMLTK: An XML Toolkit for Scalable XML Stream Processing , 2002 .

[9]  Alexander L. Wolf,et al.  Content-Based Networking: A New Communication Infrastructure , 2001, Infrastructure for Mobile and Wireless Systems.

[10]  Toshihiro Fujito,et al.  Approximation algorithms for submodular set cover with applications , 2000 .

[11]  Jean Jacques Moreau,et al.  SOAP Version 1. 2 Part 1: Messaging Framework , 2003 .

[12]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[13]  Hector Garcia-Molina,et al.  Index structures for selective dissemination of information under the Boolean model , 1994, TODS.

[14]  Hector Garcia-Molina,et al.  The SIFT information dissemination system , 1999, TODS.

[15]  Dan Suciu,et al.  View Selection for Stream Processing , 2002, WebDB.

[16]  Randeep Bhatia,et al.  Book review: Approximation Algorithms for NP-hard Problems. Edited by Dorit S. Hochbaum (PWS, 1997) , 1998, SIGA.

[17]  Dan Suciu,et al.  Stream processing of XPath queries with predicates , 2003, SIGMOD '03.

[18]  Susan T. Dumais,et al.  Personalized information delivery: an analysis of information filtering methods , 1992, CACM.

[19]  Dan Suciu,et al.  Processing XML Streams with Deterministic Automata , 2003, ICDT.