A distributed tree data structure for real-time OLAP on cloud architectures

In contrast to queries for on-line transaction processing (OLTP) systems that typically access only a small portion of a database, OLAP queries may need to aggregate large portions of a database which often leads to performance issues. In this paper we introduce CR-OLAP, a Cloud based Real-time OLAP system based on a new distributed index structure for OLAP, the distributed PDCR tree, that utilizes a cloud infrastructure consisting of (m + 1) multi-core processors. With increasing database size, CR-OLAP dynamically increases m to maintain performance. Our distributed PDCR tree data structure supports multiple dimension hierarchies and efficient query processing on the elaborate dimension hierarchies which are so central to OLAP systems. It is particularly efficient for complex OLAP queries that need to aggregate large portions of the data warehouse, such as “report the total sales in all stores located in California and New York during the months February-May of all years”. We evaluated CR-OLAP on the Amazon EC2 cloud, using the TPC-DS benchmark data set. The tests demonstrate that CR-OLAP scales well with increasing number of processors, even for complex queries. For example, on an Amazon EC2 cloud instance with eight processors, for a TPC-DS OLAP query stream on a data warehouse with 80 million tuples where every OLAP query aggregates more than 50% of the database, CR-OLAP achieved a query latency of 0.3 seconds which can be considered a real time response.

[1]  Dimitrios Tsoumakos,et al.  Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system , 2011, J. Parallel Distributed Comput..

[2]  Susanne E. Hambrusch,et al.  Parallelizing the Data Cube , 2001, Distributed and Parallel Databases.

[3]  Beate List,et al.  Striving towards Near Real-Time Data Integration for Data Warehouses , 2002, DaWaK.

[4]  Jorge Bernardino,et al.  Optimizing data warehouse loading procedures for enabling useful-time data warehousing , 2009, IDEAS '09.

[5]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Hans-Peter Kriegel,et al.  The DC-tree: a fully dynamic index structure for data warehouses , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[8]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[9]  Jorge Bernardino,et al.  Real-time data warehouse loading methodology , 2008, IDEAS '08.

[10]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Ying Chen,et al.  PnP: sequential, external memory, and parallel iceberg cube computation , 2008, Distributed and Parallel Databases.

[13]  Zhou Guo Parallel Data Cube Computation on Graphic Processing Units , 2010 .

[14]  Dong Jin,et al.  An Incremental Maintenance Scheme of Data Cubes and Its Evaluation , 2009 .

[15]  Dimitrios Tsoumakos,et al.  Online querying of d-dimensional hierarchies , 2011, J. Parallel Distributed Comput..

[16]  Yannis Sismanis,et al.  Dwarf: shrinking the PetaCube , 2002, SIGMOD '02.

[17]  Raymond T. Ng,et al.  Iceberg-cube computation with PC clusters , 2001, SIGMOD '01.

[18]  Beng Chin Ooi,et al.  Indexing multi-dimensional data in a cloud system , 2010, SIGMOD Conference.

[19]  Gagan Agrawal,et al.  A fault-tolerant environment for large-scale query processing , 2012, 2012 19th International Conference on High Performance Computing.

[20]  Beng Chin Ooi,et al.  Efficient B-tree based indexing for cloud data processing , 2010, Proc. VLDB Endow..

[21]  Frank Dehne,et al.  Parallel Real-Time OLAP on Multi-core Processors , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[22]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[23]  Hu Chen,et al.  A Parallel Algorithm for Closed Cube Computation , 2008, Seventh IEEE/ACIS International Conference on Computer and Information Science (icis 2008).

[24]  Dimitrios Tsoumakos,et al.  Distributing and searching concept hierarchies: an adaptive DHT-based system , 2010, Cluster Computing.

[25]  Xiaofeng Meng,et al.  An efficient multi-dimensional index for cloud data management , 2009, CloudDB@CIKM.

[26]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.