Query Processing in Multistore Systems

Cloud computing is having a major impact on data management, with a proliferation of new, scalable data management solutions such as distributed file and object storage, NoSQL databases and big data processing frameworks. This also leads to a wide diversification of DBMS interfaces and the loss of a common programming paradigm, making it very hard for a user to integrate its data sitting in specialized data stores, e.g. relational, documents and graph data stores.In this thesis, we address the problem of query processing with multiple cloud data stores, where the data stores have different models, languages and APIs. This thesis has been prepared in the context of the CoherentPaaS European project and, in particular, the CloudMdsQL multistore system. CloudMdsQL is a functional query language able to exploit the full power of local data stores, by simply allowing some local data store native queries to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping.In this thesis, we propose an extension of CloudMdsQL to take full advantage of the functionality of the underlying data processing frameworks such as Spark by allowing the ad-hoc usage of user defined map/filter/reduce (MFR) operators in combination with traditional SQL statements. This allows performing joins between relational and HDFS big data. Our solution allows for optimization by enabling subquery rewriting so that bind join can be used and filter conditions can be pushed down and applied by the data processing framework as early as possible.We validated our solution by implementing the MFR extension as part of the CloudMdsQL query engine. Based on this prototype, we provide an experimental validation of multistore query processing in a cluster to evaluate the impact on performance of optimization. More specifically, we explore the performance benefit of using bind join and select pushdown under different conditions. Overall, our performance evaluation illustrates the CloudMdsQL query engine’s ability to optimize a query and choose the most efficient execution strategy.

[1]  Tore Risch,et al.  Querying combined cloud-based and relational databases , 2011, 2011 International Conference on Cloud and Service Computing.

[2]  Christina Freytag,et al.  The Definitive Guide To Mongodb The Nosql Database For Cloud And Desktop Computing , 2016 .

[3]  Zhen Hua Liu,et al.  Efficient Support of XQuery Update Facility in XML Enabled RDBMS , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[4]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[5]  Raghu Ramakrishnan,et al.  Data Management in the Cloud , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[6]  Patrick Valduriez,et al.  Multistore Big Data Integration with CloudMdsQL , 2016, Trans. Large Scale Data Knowl. Centered Syst..

[7]  Patrick Valduriez,et al.  Functional SOL (FSOL), an SQL upward-compatible database programming language , 1992, Inf. Sci..

[8]  V. S. Subrahmanian,et al.  Maintaining views incrementally , 1993, SIGMOD Conference.

[9]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[10]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[11]  Patrick Valduriez,et al.  A FAD for Data Intensive Applications , 1992, IEEE Trans. Knowl. Data Eng..

[12]  Michael Stonebraker,et al.  The design and implementation of INGRES , 1976, TODS.

[13]  Edward L. Robertson,et al.  Relational languages for metadata integration , 2005, TODS.

[14]  Peter Haase,et al.  An evaluation of approaches to federated query processing over linked data , 2010, I-SEMANTICS '10.

[15]  Patrick Valduriez,et al.  Design and Implementation of the CloudMdsQL Multistore System , 2016, CLOSER.

[16]  Calisto Zuzarte,et al.  Query Rewrites with Views for XML in DB2 , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[17]  Brian Beckman,et al.  LINQ: reconciling object, relations and XML in the .NET framework , 2006, SIGMOD Conference.

[18]  Hakan Hacigümüs,et al.  MISO: souping up big data query processing with a multistore system , 2014, SIGMOD Conference.

[19]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[20]  Michael Stonebraker,et al.  Operating system support for database management , 1981, CACM.

[21]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[22]  GhemawatSanjay,et al.  The Google file system , 2003 .

[23]  Patrick Valduriez,et al.  CloudMdsQL: querying heterogeneous cloud data stores with a common language , 2016, Distributed and Parallel Databases.

[24]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[25]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[26]  Patrick Valduriez,et al.  Principles of Distributed Database Systems, Third Edition , 2011 .

[27]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[28]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[29]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..

[30]  Kevin Wilkinson,et al.  Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[31]  Carsten Binnig,et al.  FunSQL: it is time to make SQL functional , 2012, EDBT-ICDT '12.

[32]  David J. DeWitt,et al.  Indexing HDFS Data in PDW: Splitting the data from the index , 2014, Proc. VLDB Endow..

[33]  Michael Stonebraker,et al.  S-Store: A Streaming NewSQL System for Big Velocity Applications , 2014, Proc. VLDB Endow..

[34]  Yannis Papakonstantinou,et al.  FORWARD: Data-Centric UIs using Declarative Templates that Efficiently Wrap Third-Party JavaScript Components , 2014, Proc. VLDB Endow..

[35]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[36]  Qiang Zhu,et al.  Developing cost models with qualitative variables for dynamic multidatabase environments , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[37]  Volker Markl,et al.  Iterative parallel data processing with stratosphere: an inside look , 2013, SIGMOD '13.

[38]  Kevin Wilkinson,et al.  QoX-driven ETL design: reducing the cost of ETL consulting engagements , 2009, SIGMOD Conference.

[39]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[40]  Hakan Hacigümüs,et al.  Odyssey: A Multi-Store System for Evolutionary Analytics , 2013, Proc. VLDB Endow..

[41]  Reza Akbarinia,et al.  P2P Techniques for Decentralized Applications , 2012, Synthesis Lectures on Data Management.

[42]  Per-Åke Larson,et al.  A query sampling method for estimating local cost parameters in a multidatabase system , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[43]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[44]  Patrick Valduriez,et al.  The CloudMdsQL Multistore System , 2016, SIGMOD Conference.

[45]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[46]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[47]  Tom White,et al.  Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale (4. ed., revised & updated) , 2012 .

[48]  Haixun Wang,et al.  Trinity: a distributed graph engine on a memory cloud , 2013, SIGMOD '13.

[49]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[50]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[51]  Yannis Papakonstantinou,et al.  The SQL++ Semi-structured Data Model and Query Language: A Capabilities Survey of SQL-on-Hadoop, NoSQL and NewSQL Databases , 2014, ArXiv.

[52]  Patrick Valduriez,et al.  Query processing in multistore systems: an overview , 2016, Int. J. Cloud Comput..

[53]  Qiang Zhu,et al.  Global Query Processing and Optimization in the CORDS Multidatabase System , 1996 .

[54]  Patrick Valduriez,et al.  Benchmarking polystores: The CloudMdsQL experience , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[55]  Ioana Manolescu,et al.  Invisible Glue: Scalable Self-Tunning Multi-Stores , 2015, CIDR.

[56]  Patrick Valduriez,et al.  Integrating Big Data and Relational Data with a Functional SQL-like Query Language , 2015, DEXA.