Parallel analytics as a service

Recently, massively parallel processing relational database systems (MPPDBs) have gained much momentum in the big data analytic market. With the advent of hosted cloud computing, we envision that the offering of MPPDB-as-a-Service (MPPDBaaS) will become attractive for companies having analytical tasks on only hundreds gigabytes to some ten terabytes of data because they can enjoy high-end parallel analytics at a cheap cost. This paper presents Thrifty, a prototype implementation of MPPDB-as-a-service. The major research issue is how to achieve a lower total cost of ownership by consolidating thousands of MPPDB tenants on to a shared hardware infrastructure, with a performance SLA that guarantees the tenants can obtain the query results as if they are executing their queries on dedicated machines. Thrifty achieves the goal by using a tenant-driven design that includes (1) a cluster design that carefully arranges the nodes in the cluster into groups and creates an MPPDB for each group of nodes, (2) a tenant placement that assigns each tenant to several MPPDBs (for high availability service through replication), and (3) a query routing algorithm that routes a tenant's query to the proper MPPDB at run-time. Experiments show that in a MPPDBaaS with 5000 tenants, where each tenant requests 2 to 32 nodes MPPDB to query against 200GB to 3.2TB of data, Thrifty can serve all the tenants with a 99.9% performance SLA guarantee and a high availability replication factor of 3, using only 18.7% of the nodes requested by the tenants.

[1]  Alexander Zeier,et al.  Predicting in-memory database performance for automating cluster management tasks , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[2]  Xiaohui Gu,et al.  CloudScale: elastic resource scaling for multi-tenant cloud systems , 2011, SoCC.

[3]  Yuan Zhou,et al.  Supporting Database Applications as a Service , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  Torsten Grust,et al.  Multi-tenant databases for software as a service: schema-mapping techniques , 2008, SIGMOD Conference.

[5]  Donald R. Jones,et al.  Direct Global Optimization Algorithm , 2009, Encyclopedia of Optimization.

[6]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[7]  Eli Upfal,et al.  Performance prediction for concurrent database workloads , 2011, SIGMOD '11.

[8]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[9]  Carlo Curino,et al.  Relational Cloud: a Database Service for the cloud , 2011, CIDR.

[10]  Rina Panigrahy,et al.  Validating Heuristics for Virtual Machines Consolidation , 2011 .

[11]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[12]  Ashraf Aboulnaga,et al.  Automatic virtual machine configuration for database workloads , 2008, SIGMOD Conference.

[13]  Bernhard Mitschang,et al.  Native support of multi-tenancy in RDBMS for software as a service , 2011, EDBT/ICDT '11.

[14]  Jerome A. Rolia,et al.  An integrated approach to resource pool management: Policies, efficiency and quality metrics , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[15]  Philip A. Bernstein,et al.  Adapting microsoft SQL server for cloud computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[16]  Shivnath Babu,et al.  Predicting completion times of batch query workloads using interaction-aware models and simulation , 2011, EDBT/ICDT '11.

[17]  Nicolas Bruno,et al.  Automated partitioning design in parallel database systems , 2011, SIGMOD '11.

[18]  Neoklis Polyzotis,et al.  Divergent physical design tuning for replicated databases , 2012, SIGMOD Conference.

[19]  Yanpei Chen,et al.  Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads , 2012, Proc. VLDB Endow..

[20]  Dean Jacobs,et al.  Ruminations on Multi-Tenant Databases , 2007, BTW.

[21]  Carlo Curino,et al.  Workload-aware database monitoring and consolidation , 2011, SIGMOD '11.

[22]  Philip S. Yu,et al.  Multi-query SQL Progress Indicators , 2006, EDBT.

[23]  Rina Panigrahy,et al.  Heuristics for Vector Bin Packing , 2011 .