Storing semi-structured data on disk drives

Applications that manage semi-structured data are becoming increasingly commonplace. Current approaches for storing semi-structured data use existing storage machinery; they either map the data to relational databases, or use a combination of flat files and indexes. While employing these existing storage mechanisms provides readily available solutions, there is a need to more closely examine their suitability to this class of data. Particularly, retrofitting existing solutions for semi-structured data can result in a mismatch between the tree structure of the data and the access characteristics of the underlying storage device (disk drive). This study explores various possibilities in the design space of native storage solutions for semi-structured data by exploring alternative approaches that match application data access characteristics to those of the underlying disk drive. For evaluating the effectiveness of the proposed native techniques in relation to the existing solution, we experiment with XML data using the XPathMark benchmark. Extensive evaluation reveals the strengths and weaknesses of the proposed native data layout techniques. While the existing solutions work really well for deep-focused queries into a semi-structured document (those that result in retrieving entire subtrees), the proposed native solutions substantially outperform for the non-deep-focused queries, which we demonstrate are at least as important as the deep-focused. We believe that native data layout techniques offer a unique direction for improving the performance of semi-structured data stores for a variety of important workloads. However, given that the proposed native techniques require circumventing current storage stack abstractions, further investigation is warranted before they can be applied to general-purpose storage systems.

[1]  Ioana Manolescu,et al.  XMark: A Benchmark for XML Data Management , 2002, VLDB.

[2]  Steven J. DeRose,et al.  XML Path Language (XPath) Version 1.0 , 1999 .

[3]  XML parsing: a threat to database performance , 2003, CIKM '03.

[4]  Remzi H. Arpaci-Dusseau,et al.  Micro-Benchmark Based Extraction of Local and Global Disk , 2000 .

[5]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[6]  Srikanta J. Bedathur,et al.  Search-Optimized Suffix-Tree Storage for Biological Applications , 2005, HiPC.

[7]  Sihem Amer-Yahia,et al.  ShreX: Managing XML Documents in Relational Databases , 2004, VLDB.

[8]  Guido Moerkotte,et al.  A linear time algorithm for optimal tree sibling partitioning and approximation algorithms in Natix , 2006, VLDB.

[9]  Juliana Freire,et al.  From XML schema to relations: a cost-based approach to XML storage , 2002, Proceedings 18th International Conference on Data Engineering.

[10]  Remzi H. Arpaci-Dusseau,et al.  Microbenchmark-based Extraction of Local and Global Disk Characteristics , 1999 .

[11]  Christos Faloutsos,et al.  Active Storage for Large-Scale Data Mining and Multimedia , 1998, VLDB.

[12]  Peter Druschel,et al.  Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O , 2001, SOSP.

[13]  Guido Moerkotte,et al.  Cost-sensitive reordering of navigational primitives , 2005, SIGMOD '05.

[14]  Cong Yu,et al.  TIMBER: A native XML database , 2002, The VLDB Journal.

[15]  Hamid Pirahesh,et al.  System RX: one part relational, one part XML , 2005, SIGMOD '05.

[16]  Matthias Nicola,et al.  An XML transaction processing benchmark , 2007, SIGMOD '07.

[17]  Jeffrey F. Naughton,et al.  Covering indexes for branching path queries , 2002, SIGMOD '02.

[18]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[19]  Vagelis Hristidis,et al.  Beyond Lazy XML Parsing , 2007, DEXA.

[20]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[21]  Juliana Freire,et al.  Searching for Efficient XML-to-Relational Mappings , 2003, Xsym.

[22]  Stéphane Bressan,et al.  XOO7: applying OO7 benchmark to XML query processing tool , 2001, CIKM '01.

[23]  Erhard Rahm,et al.  XMach-1: A Benchmark for XML Data Management , 2001, BTW.

[24]  Mahadev Satyanarayanan,et al.  Diamond: A Storage Architecture for Early Discard in Interactive Search , 2004, FAST.

[25]  Gregory R. Ganger,et al.  Blurring the Line Between Oses and Storage Devices (CMU-CS-01-166) , 2001 .

[26]  Boris Novikov,et al.  An Analysis of Alternative Methods for Storing Semistructured Data in Relations , 2000, ADBIS-DASFAA.

[27]  David Salesin,et al.  Multiresolution video , 1996, SIGGRAPH.

[28]  Andrea C. Arpaci-Dusseau,et al.  Semantically-Smart Disk Systems , 2003, FAST.

[29]  Jayant Sharma,et al.  Geography Markup Language , 2009, Encyclopedia of Database Systems.

[30]  Antoine Quint,et al.  Scalable Vector Graphics , 2020, Definitions.

[31]  David S. Burggraf Geography Markup Language , 2006, Data Sci. J..

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[33]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[34]  Zoran Dimitrijevic,et al.  Diskbench : User-level Disk Feature Extraction Tool , 2004 .

[35]  Ioana Manolescu,et al.  Towards micro-benchmarking XQuery , 2008, ExpDB.

[36]  Sukhamay Kundu,et al.  A Linear Tree Partitioning Algorithm , 1977, SIAM J. Comput..

[37]  Quanzhong Li,et al.  Indexing and Querying XML Data for Regular Path Expressions , 2001, VLDB.

[38]  Ioana Manolescu,et al.  MemBeR: A Micro-benchmark Repository for XQuery , 2005, XSym.

[39]  Anastasia Ailamaki,et al.  Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks , 2004, FAST.

[40]  M. Tamer Özsu,et al.  XBench - A Family of Benchmarks for XML DBMSs , 2002, EEXTT.

[41]  Amnon Shabo,et al.  Model Formulation: HL7 Clinical Document Architecture, Release 2 , 2006, J. Am. Medical Informatics Assoc..

[42]  Carlos Alberto Heuser,et al.  Matching of XML Schemas and Relational Schemas , 2004, SBBD.

[43]  Yale N. Patt,et al.  On-line extraction of SCSI disk drive parameters , 1995, SIGMETRICS '95/PERFORMANCE '95.

[44]  Massimo Franceschet XpathMark: an Xpath benchmark for XMark , 2005 .

[45]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[46]  Vagelis Hristidis,et al.  Efcient Native Storage Systems for Semi-structured Data , 2006 .

[47]  Guido Moerkotte,et al.  Efficient Storage of XML Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[48]  Massimo Franceschet XPathMark: An XPath Benchmark for the XMark Generated Data , 2005, XSym.

[49]  Xiaofeng Meng,et al.  OrientStore: A Schema Based Native XML Storage System , 2003, VLDB.

[50]  Christos Faloutsos,et al.  On multidimensional data and modern disks , 2005, FAST'05.

[51]  Denilson Barbosa,et al.  ToX - the Toronto XML Engine , 2001, Workshop on Information Integration on the Web.

[52]  Stéphane Bressan,et al.  XML BENCHMARKS PUT TO THE TEST , 2001 .

[53]  Stéphane Bressan,et al.  Current Approaches to XML Management , 2002, IEEE Internet Comput..

[54]  Erhard Rahm,et al.  Multi-user Evaluation of XML Data Management Systems with XMach-1 , 2002, EEXTT.

[55]  David A. Patterson,et al.  A case for intelligent disks (IDISKs) , 1998, SGMD.

[56]  David J. DeWitt,et al.  Relational Databases for Querying XML Documents: Limitations and Opportunities , 1999, VLDB.

[57]  Gregory R. Ganger,et al.  The DiskSim Simulation Environment Version 4.0 Reference Manual (CMU-PDL-08-101) , 1998 .

[58]  Loredana Afanasiev,et al.  An Analysis of the Current XQuery Benchmarks , 2006, ExpDB.

[59]  Welf Löwe,et al.  Lazy XML processing , 2002, DocEng '02.

[60]  Jignesh M. Patel,et al.  The Michigan benchmark: towards XML query performance diagnostics , 2006, Inf. Syst..

[61]  Alin Deutsch,et al.  Storing semistructured data with STORED , 1999, SIGMOD '99.

[62]  Jennifer Widom,et al.  The Lowell database research self-assessment , 2003, CACM.

[63]  Vagelis Hristidis,et al.  Storing Semi-structured Data on Disk Drives 1 , 2008 .

[64]  Roy Goldman,et al.  Lore: a database management system for semistructured data , 1997, SGMD.