Executing Web Application Queries on a Partitioned Database

Partitioning data over multiple storage servers is an attractive way to increase throughput for web-like workloads. However, there is often no one partitioning that yields good performance for all queries, and it can be challenging for the web developer to determine how best to execute queries over partitioned data. This paper presents DIXIE, a SQL query planner, optimizer, and executor for databases horizontally partitioned over multiple servers. DIXIE focuses on increasing interquery parallel speedup by involving as few servers as possible in each query. One way it does this is by supporting tables with multiple copies partitioned on different columns, in order to expand the set of queries that can be satisified from a single server. DIXIE automatically transforms SQL queries to execute over a partitioned database, using a cost model and plan generator that exploit multiple table copies. We evaluate DIXIE on a database and query stream taken from Wikipedia, partitioned across ten MySQL servers. By adding one copy of a 13 MB table and using DIXIE's query optimizer, we achieve a throughput improvement of 3.2X over a single optimized partitioning of each table and 8.5X over the same data on a single server. On specific queries DIXIE with table copies increases throughput linearly with the number of servers, while the best single-table-copy partitioning achieves little scaling. For a large class of joins, which traditional wisdom suggests requires tables partitioned on the join keys, DIXIE can find higher-performance plans using other partitionings.

[1]  Carlo Curino,et al.  Workload-aware database monitoring and consolidation , 2011, SIGMOD '11.

[2]  Pablo Rodriguez,et al.  The Little Engine(s) That Could: Scaling Online Social Networks , 2010, IEEE/ACM Transactions on Networking.

[3]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[4]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[5]  Guillaume Pierre,et al.  Wikipedia workload analysis for decentralized hosting , 2009, Comput. Networks.

[6]  J. M. Pujol,et al.  Scaling Online Social Networks without Pains , 2009 .

[7]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[8]  Tom W. Keller,et al.  Data placement in Bubba , 1988, SIGMOD '88.

[9]  Philip A. Bernstein,et al.  Adapting microsoft SQL server for cloud computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[10]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[11]  Michael Stonebraker,et al.  H-store: a high-performance, distributed main memory transaction processing system , 2008, Proc. VLDB Endow..

[12]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[13]  Guy M. Lohman,et al.  Optimizer Validation and Performance Evaluation for Distributed Queries , 1998 .

[14]  Zachary G. Ives,et al.  Reliable storage and querying for collaborative data sharing systems , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, IEEE Database Eng. Bull..

[16]  Stefano Ceri,et al.  Horizontal data partitioning in database design , 1982, SIGMOD '82.

[17]  Michael Stonebraker,et al.  Distributed query processing in a relational data base system , 1978, SIGMOD Conference.

[18]  Andreas Reuter,et al.  Proceedings of the 2nd International Workshop on High Performance Transaction Systems , 1987 .

[19]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[20]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.