FoXtrot: Distributed structural and value XML filtering

Publish/subscribe systems have emerged in recent years as a promising paradigm for offering various popular notification services. In this context, many XML filtering systems have been proposed to efficiently identify XML data that matches user interests expressed as queries in an XML query language like XPath. However, in order to offer XML filtering functionality on an Internet-scale, we need to deploy such a service in a distributed environment, avoiding bottlenecks that can deteriorate performance. In this work, we design and implement FoXtrot, a system for filtering XML data that combines the strengths of automata for efficient filtering and distributed hash tables for building a fully distributed system. Apart from structural-matching, performed using automata, we also discuss different methods for evaluating value-based predicates. We perform an extensive experimental evaluation of our system, FoXtrot, on a local cluster and on the PlanetLab network and demonstrate that it can index millions of user queries, achieving a high indexing and filtering throughput. At the same time, FoXtrot exhibits very good load-balancing properties and improves its performance as we increase the size of the network.

[1]  Yuqing Wu,et al.  XML-based RDF data management for efficient query processing , 2010, WebDB '10.

[2]  Yanlei Diao,et al.  Towards an Internet-Scale XML Dissemination Service , 2004, VLDB.

[3]  Chi Zhang,et al.  Brushwood: Distributed Trees in Peer-to-Peer Systems , 2005, IPTPS.

[4]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[5]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[6]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[7]  Beng Chin Ooi,et al.  BATON: A Balanced Tree Structure for Peer-to-Peer Networks , 2005, VLDB.

[8]  Marcelo Arenas,et al.  nSPARQL: A Navigational Language for RDF , 2008, SEMWEB.

[9]  Sriram Ramabhadran,et al.  Brief announcement: prefix hash tree , 2004, PODC '04.

[10]  Manolis Koubarakis,et al.  Distributed structural and value XML filtering , 2010, DEBS '10.

[11]  Sujata Banerjee,et al.  SmartSeer: Using a DHT to Process Continuous Queries Over Peer-to-Peer Networks , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[12]  Jon Crowcroft,et al.  A survey and comparison of peer-to-peer overlay network schemes , 2005, IEEE Communications Surveys & Tutorials.

[13]  Beng Chin Ooi,et al.  Speeding up search in peer-to-peer networks with a multi-way tree structure , 2006, SIGMOD Conference.

[14]  Pascal Felber,et al.  A scalable protocol for content-based routing in overlay networks , 2003, Second IEEE International Symposium on Network Computing and Applications, 2003. NCA 2003..

[15]  Karl Aberer,et al.  P-Grid: a self-organizing structured P2P system , 2003, SGMD.

[16]  Rajeev Rastogi,et al.  Scalable Filtering of XML Data for Web Services , 2003, IEEE Internet Comput..

[17]  Bongki Moon,et al.  Locating XML Documents in a Peer-to-Peer Network Using Distributed Hash Tables , 2009, IEEE Transactions on Knowledge and Data Engineering.

[18]  Aoying Zhou,et al.  Sonnet: an efficient distributed content-based dissemination broker , 2007, SIGMOD '07.

[19]  Emin Gün Sirer,et al.  Corona: A High Performance Publish-Subscribe System for the World Wide Web , 2006, NSDI.

[20]  Evaggelia Pitoura,et al.  Content-Based Routing of Path Queries in Peer-to-Peer Systems , 2004, EDBT.

[21]  Olga Papaemmanouil,et al.  SemCast: semantic multicast for content-based data dissemination , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Hans-Arno Jacobsen,et al.  Predicate-based Filtering of XPath Expressions , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[23]  Luis Gravano,et al.  Navigation- vs. index-based XML multi-query processing , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[24]  Anne-Marie Kermarrec,et al.  Sub-2-Sub: Self-Organizing Content-Based Publish and Subscribe for Dynamic and Large Scale Collaborative Networks , 2006 .

[25]  Pascal Felber,et al.  Scalable Distribution of XML Content with XNet , 2008, IEEE Transactions on Parallel and Distributed Systems.

[26]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[27]  Aoying Zhou,et al.  Bloom filter-based XML packets filtering for millions of path queries , 2005, 21st International Conference on Data Engineering (ICDE'05).

[28]  Peter Triantafillou,et al.  PastryStrings: A Comprehensive Content-Based Publish/Subscribe DHT Network , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[29]  Jussi Myllymaki,et al.  Implementing a scalable XML publish/subscribe system using relational database systems , 2004, SIGMOD '04.

[30]  Rajeev Rastogi,et al.  Efficient filtering of XML documents with XPath expressions , 2002, The VLDB Journal.

[31]  Michael J. Franklin,et al.  Efficient Filtering of XML Documents for Selective Dissemination of Information , 2000, VLDB.

[32]  Yin Zhang,et al.  XTreeNet: scalable overlay networks for XML content dissemination and querying (synopsis) , 2005, 10th International Workshop on Web Content Caching and Distribution (WCW'05).

[33]  David J. DeWitt,et al.  Locating Data Sources in Large Distributed Systems , 2003, VLDB.

[34]  Anne-Marie Kermarrec,et al.  Sub-2-Sub: Self-Organizing Content-Based Publish Subscribe for Dynamic Large Scale Collaborative Networks , 2006, IPTPS.

[35]  David R. Karger,et al.  Looking up data in P2P systems , 2003, CACM.

[36]  Takashi Honishi,et al.  Distributed XML stream filtering system with high scalability , 2005, 21st International Conference on Data Engineering (ICDE'05).

[37]  James Aspnes,et al.  Skip graphs , 2003, SODA '03.

[38]  Petko Bakalov,et al.  Early Profile Pruning on XML-aware Publish/Subscribe Systems , 2007, VLDB.

[39]  Manolis Koubarakis,et al.  Xml data dissemination using automata on top of structured overlay networks , 2008, WWW.

[40]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[41]  Manolis Koubarakis,et al.  Evaluating Conjunctive Triple Pattern Queries over Large Structured Overlay Networks , 2006, SEMWEB.

[42]  Alex C. Snoeren,et al.  Mesh-based content routing using XML , 2001, SOSP.

[43]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[44]  Karl Aberer,et al.  Efficient Processing of XPath Queries with Structured Overlay Networks , 2005, OTM Conferences.

[45]  Hao Zhang,et al.  Path sharing and predicate evaluation for high-performance XML filtering , 2003, TODS.

[46]  Yuan Ni,et al.  Efficient xml data dissemination with piggybacking , 2007, SIGMOD '07.

[47]  Tova Milo,et al.  Optimizing queries on files , 1994, SIGMOD '94.

[48]  Dan Suciu,et al.  Stream processing of XPath queries with predicates , 2003, SIGMOD '03.

[49]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[50]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[51]  Alfredo Cuzzocrea,et al.  XPath lookup queries in P2P networks , 2004, WIDM '04.

[52]  Ioana Manolescu,et al.  XML processing in DHT networks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[53]  Denilson Barbosa,et al.  Studying the XML Web: Gathering Statistics from an XML Sample , 2006, World Wide Web.

[54]  Scott Shenker,et al.  Fixing the Embarrassing Slowness of OpenDHT on PlanetLab , 2005, WORLDS.

[55]  Manolis Koubarakis,et al.  Publish/subscribe functionality in IR environments using structured overlay networks , 2005, SIGIR '05.