Benchmarking SQL on MapReduce systems using large astronomy databases

In the era of bigdata, with a massive set of digital information of unprecedented volumes being collected and/or produced in several application domains, it becomes more and more difficult to manage and query large data repositories. In the framework of the PetaSky project (http://com.isima.fr/Petasky), we focus on the problem of managing scientific data in the field of cosmology. The data we consider are those of the LSST project (http://www.lsst.org/). The overall size of the database that will be produced is expected to exceed 60 PB (Lsst data challenge handbook, 2012). In order to evaluate the performances of existing SQL On MapReduce data management systems, we conducted extensive experiments by using data and queries from the area of cosmology. The goal of this work is to report on the ability of such systems to support large scale declarative queries. We mainly investigated the impact of data partitioning, indexing and compression on query execution performances.

[1]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[2]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[3]  Mohand-Said Hacid,et al.  A test framework for large scale declarative queries: preliminary results , 2014, SAC.

[4]  Shengsheng Huang,et al.  HiBench : A Representative and Comprehensive Hadoop Benchmark Suite , 2012 .

[5]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[6]  Nrusimham Ammu,et al.  Big Data Challenges , 2013 .

[7]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[8]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[9]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[10]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[11]  Younghoon Kim,et al.  Parallel Top-K Similarity Join Algorithms Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[12]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[13]  Abraham Silberschatz,et al.  Efficient processing of data warehousing queries in a split execution environment , 2011, SIGMOD '11.

[14]  Zhiwei Xu,et al.  RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[15]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Raghu Ramakrishnan,et al.  Sailfish: a framework for large scale data processing , 2012, SoCC '12.

[17]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[18]  Mohand-Said Hacid,et al.  A Comparison of Systems to Large-Scale Data Access , 2014, DASFAA Workshops.

[19]  Jeffrey D. Ullman,et al.  Optimizing Multiway Joins in a Map-Reduce Environment , 2011, IEEE Transactions on Knowledge and Data Engineering.

[20]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[23]  Yuqing Zhu,et al.  BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework , 2014, DASFAA.

[24]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[25]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[26]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[27]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[28]  Amin Vahdat,et al.  Themis: an I/O-efficient MapReduce , 2012, SoCC '12.