Leveraging MapReduce with Column-Oriented Stores: Study of Solutions and Benefits

The MapReduce framework is a powerful tool to process large volume of data. It is becoming ubiquitous and is generally used with column-oriented stores. It offers high scalability and fault tolerance in large-scale data processing, but still there are certain issues when it comes to access data from columnar stores. In this paper, first, we compare the features of column stores with row stores in terms of storing and accessing the data. The paper is focused on studying the main challenges that arise when column stores are used with MapReduce, such as data co-location, distribution, serialization, and data compression. Effective solutions to overcome these challenges are also discussed.

[1]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[2]  Songting Chen,et al.  Cheetah , 2010, Proc. VLDB Endow..

[3]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[6]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[7]  S. D. Madhu Kumar,et al.  Dynamic Colocation Algorithm for Hadoop , 2014, 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[8]  Vinay Setty,et al.  Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing) , 2010, Proc. VLDB Endow..

[9]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[10]  Jorge-Arnulfo Quiané-Ruiz,et al.  Efficient Big Data Processing in Hadoop MapReduce , 2012, Proc. VLDB Endow..

[11]  Sandeep Tata,et al.  Clydesdale: structured data processing on MapReduce , 2012, EDBT '12.

[12]  Jorge-Arnulfo Quiané-Ruiz,et al.  Trojan data layouts: right shoes for a running elephant , 2011, SoCC.

[13]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[14]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[15]  Jorge-Arnulfo Quiané-Ruiz,et al.  Only Aggressive Elephants are Fast Elephants , 2012, Proc. VLDB Endow..

[16]  Beng Chin Ooi,et al.  Llama: leveraging columnar storage for scalable join processing in the MapReduce framework , 2011, SIGMOD '11.