Fast in-memory XPath search using compressed indexes

A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1–3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.

[1]  Gonzalo Navarro,et al.  An In-Memory XQuery/XPath Engine over a Compressed Structured Text Representation , 2008, Structure-Based Compression of Complex Massive Data.

[2]  Cristina Sirangelo,et al.  Reasoning about XML with temporal logics and automata , 2010, J. Appl. Log..

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  Siu-Ming Yiu,et al.  Compressed indexing and local alignment of DNA , 2008, Bioinform..

[5]  Sebastian Maneth,et al.  Fast and Tiny Structural Self-Indexes for XML , 2010, ArXiv.

[6]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[7]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[8]  Dan Suciu,et al.  Processing XML streams with deterministic automata and stream indexes , 2004, TODS.

[9]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[10]  Jennifer Widom,et al.  Query Optimization for XML , 1999, VLDB.

[11]  Sebastian Maneth,et al.  XPath whole query optimization , 2010, Proc. VLDB Endow..

[12]  Dan Olteanu,et al.  SPEX: Streamed and Progressive Evaluation of XPath , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[14]  Gonzalo Navarro,et al.  Dynamic entropy-compressed sequences and full-text indexes , 2006, TALG.

[15]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[16]  Gonzalo Navarro,et al.  Reducing the Space Requirement of LZ-Index , 2006, CPM.

[17]  Torsten Grust,et al.  MonetDB/XQuery: a fast XQuery processor powered by a relational engine , 2006, SIGMOD Conference.

[18]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[19]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[20]  Michael Benedikt,et al.  XPath leashed , 2009, CSUR.

[21]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[22]  Massimo Franceschet XPathMark: An XPath Benchmark for the XMark Generated Data , 2005, XSym.

[23]  Venkatesh Raman,et al.  Succinct representation of balanced parentheses, static trees and planar graphs , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[24]  Amélie Marian,et al.  Implementing Xquery 1.0: The Galax Experience , 2003, VLDB.

[25]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[26]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[27]  Pierre Genevès,et al.  XML reasoning made practical , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[28]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[29]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[30]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[31]  Naila Rahman,et al.  A simple optimal representation for balanced parentheses , 2004, Theor. Comput. Sci..

[32]  Frank Neven,et al.  Automata theory for XML researchers , 2002, SGMD.

[33]  Torsten Grust,et al.  Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps , 2003, VLDB.

[34]  Stefan Böttcher,et al.  BSBC: Towards a Succinct Data Format for XML Streams , 2008, WEBIST.

[35]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[36]  Christoph Koch,et al.  Efficient Processing of Expressive Node-Selecting Queries on XML Data in Secondary Storage: A Tree Automata-based Approach , 2003, VLDB.

[37]  Stefan Böttcher,et al.  Search and Modification in Compressed Texts , 2011, 2011 Data Compression Conference.

[38]  Jean-Christophe Filliâtre,et al.  Type-safe modular hash-consing , 2006, ML '06.

[39]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[40]  Thomas Schwentick,et al.  Automata for XML - A survey , 2007, J. Comput. Syst. Sci..

[41]  Gonzalo Navarro,et al.  Implicit Compression Boosting with Applications to Self-indexing , 2007, SPIRE.

[42]  Stefanie Scherzinger,et al.  The GCX System: Dynamic Buffer Minimization in Streaming XQuery Evaluation , 2007, VLDB.

[43]  Fabrizio Luccio,et al.  Compressing and searching XML data via two zips , 2006, WWW '06.

[44]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[45]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[46]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[47]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[48]  Djoerd Hiemstra,et al.  TIJAH: Embracing IR Methods in XML Databases , 2005, Information Retrieval.

[49]  Diego Arroyuelo,et al.  An Improved Succinct Representation for Dynamic k-ary Trees , 2008, CPM.

[50]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[51]  Massimo Franceschet,et al.  XPathMark: Functional and Performance Tests for XPath , 2006, XQuery Implementation Paradigms.

[52]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[53]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[54]  Henrik Björklund,et al.  Incremental XPath evaluation , 2009, ICDT '09.

[55]  Rajeev Raman,et al.  Representing Trees of Higher Degree , 2005, Algorithmica.

[56]  Haruo Hosoya Foundations of XML Processing: The Tree-Automata Approach , 2010 .

[57]  Mikolaj Bojanczyk,et al.  XPath evaluation in linear time , 2011, JACM.

[58]  Fabrizio Luccio,et al.  Structuring labeled trees for optimal succinctness, and beyond , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[59]  Michael H. Kay Ten Reasons Why Saxon XQuery is Fast , 2008, IEEE Data Eng. Bull..