Building cubes with MapReduce

In the last years, the problems of using generic storage techniques for very specific applications has been detected and outlined. Thus, some alternatives to relational DBMSs (e.g., BigTable) are blooming. On the other hand, cloud computing is already a reality that helps to save money by eliminating the hardware as well as software fixed costs and just pay per use. Indeed, specific software tools to exploit a cloud are also here. The trend in this case is toward using tools based on the MapReduce paradigm developed by Google. In this paper, we explore the possibility of having data in a cloud by using BigTable to store the corporate historical data and MapReduce as an agile mechanism to deploy cubes in ad-hoc Data Marts. Our main contribution is the comparison of three different approaches to retrieve data cubes from BigTable by means of MapReduce and the definition of criteria to choose among them.

[1]  Daniel J. Abadi,et al.  Column-stores vs. row-stores: how different are they really? , 2008, SIGMOD Conference.

[2]  W. H. Inmon,et al.  Building the data warehouse , 1992 .

[3]  P R Limb,et al.  Data Mining — Tools and Techniques , 1996 .

[4]  Matteo Golfarelli,et al.  Data Warehouse Design: Modern Principles and Methodologies , 2009 .

[5]  E. F. Codd,et al.  Providing OLAP to User-Analysts: An IT Mandate , 1998 .

[6]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[7]  Michael Stonebraker,et al.  MapReduce and parallel DBMSs: friends or foes? , 2010, CACM.

[8]  Daniel J. Abadi,et al.  Data Management in the Cloud: Limitations and Opportunities , 2009, IEEE Data Eng. Bull..

[9]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[10]  W. H. Inmon,et al.  Corporate Information Factory , 1998 .

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  Setrag Khoshafian,et al.  A decomposition storage model , 1985, SIGMOD Conference.

[13]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Chuan Zhang,et al.  HDW: A High Performance Large Scale Data Warehouse , 2008, 2008 International Multi-symposiums on Computer and Computational Sciences.

[18]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[19]  Wolfgang Lehner Query Processing in Data Warehouses , 2009, Encyclopedia of Database Systems.

[20]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.