A Practice of TPC-DS Multidimensional Implementation on NoSQL Database Systems

While NoSQL database systems are well established, it is not clear how to process multidimensional OLAP queries on current key-value stores. In this paper, we detail how to match the high-level cube model with the low-level key-value stores built on NoSQL databases, and illustrate how to support efficiently OLAP queries by scale out while retaining a MapReduce-like execution engine. For big data the functional problem of storage and processing power is compounded, we balanced them with partial aggregation between batch processing and query runtime. Base cuboids are initially constructed for TPC-DS fact tables by using multidimensional array, and cuboids for various granularity aggregation data are derived at runtime with base ones. The cube storage module converts dimension members into binary keys and leverages a novel distributed database to provide efficient storage for huge cuboids. The OLAP engine built on lightweight concurrent actors can scale out seamlessly; provide highly concurrent distributed cuboid processing. Finally, we illustrate some experiments on the implementation prototype based on TPC-DS queries. The results show that multidimensional models for OLAP applications on NoSQL systems are possible for future big data analytics.

[1]  Haimonti Dutta,et al.  Distributed Storage of Large-Scale Multidimensional Electroencephalogram Data Using Hadoop and HBase , 2011, Grid and Cloud Database Management.

[2]  Cristina Dutra de Aguiar Ciferri,et al.  Cube Algebra: A Generic User-Centric Model and Query Language for OLAP Cubes , 2013, Int. J. Data Warehous. Min..

[3]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[4]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[5]  Alok N. Choudhary,et al.  High Performance OLAP and Data Mining on Parallel Computers , 1997, Data Mining and Knowledge Discovery.

[6]  Jaideep Srivastava,et al.  Aggregation Algorithms for Very Large Compressed Data Warehouses , 1999, VLDB.

[7]  Laurent d'Orazio,et al.  Multidimensional Arrays for Warehousing Data on Clouds , 2010, Globe.

[8]  I. Song,et al.  Analytics over large-scale multidimensional data: the big data revolution! , 2011, DOLAP '11.

[9]  Abdelkader Hameurlain,et al.  Data Management in Grid and Peer-to-Peer Systems , 2008 .

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Alberto Abelló,et al.  Multidimensional Design by Examples , 2006, DaWaK.

[12]  Peter Thanisch,et al.  Constructing OLAP cubes based on queries , 2001, DOLAP '01.

[13]  Sam Shah,et al.  Avatara: OLAP for Web-scale Analytics Products , 2012, Proc. VLDB Endow..

[14]  Divyakant Agrawal,et al.  $\mathcal{MD}$-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services , 2012, Distributed and Parallel Databases.

[15]  Roberto Palmieri,et al.  Hyflow2: a high performance distributed transactional memory framework in scala , 2013, PPPJ.

[16]  Carlos Ordonez,et al.  Efficient OLAP with UDFs , 2008, DOLAP '08.

[17]  Rim Moussa TPC-H Benchmark Analytics Scenarios and Performances on Hadoop Data Clouds , 2012, NDT.

[18]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[19]  Shan Wang,et al.  LinearDB: A Relational Approach to Make Data Warehouse Scale Like MapReduce , 2011, DASFAA.

[20]  Alexander Novikov,et al.  Transparent Data Cube for Spatiotemporal Data Mining and Visualization , 2011, Grid and Cloud Database Management.

[21]  Raghunath Othayoth Nambiar,et al.  Why You Should Run TPC-DS: A Workload Analysis , 2007, VLDB.

[22]  Andrew Rau-Chaplin,et al.  A distributed tree data structure for real-time OLAP on cloud architectures , 2013, 2013 IEEE International Conference on Big Data.

[23]  Sandro Fiore,et al.  Grid and Cloud Database Management , 2011, Grid and Cloud Database Management.

[24]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[25]  Robbert van Renesse,et al.  Efficient reconciliation and flow control for anti-entropy protocols , 2008, LADIS '08.

[26]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[27]  Hongjun Lu,et al.  Requirement-based data cube schema design , 1999, CIKM '99.