SciHadoop: Array-based query processing in Hadoop

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci- Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.

[1]  Limsoon Wong,et al.  A query language for multidimensional arrays: design, implementation, and optimization techniques , 1996, SIGMOD '96.

[2]  Magdalena Balazinska,et al.  ArrayStore: a storage manager for complex parallel array processing , 2011, SIGMOD '11.

[3]  References , 1971 .

[4]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[5]  Jin-Soo Kim,et al.  HAMA: An Efficient Matrix Computation with the MapReduce Framework , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[6]  Magdalena Balazinska,et al.  Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help? , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[8]  Geoffrey C. Fox,et al.  MapReduce in the Clouds for Science , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[9]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[10]  Bo Li,et al.  Parallel Accessing Massive NetCDF Data Based on MapReduce , 2010, WISM.

[11]  Willy Zwaenepoel,et al.  HadoopToSQL: a mapReduce query optimizer , 2010, EuroSys '10.

[12]  Charles S. Zender,et al.  Clustered Workflow Execution of Retargeted Data Analysis Scripts , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[13]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Rona Machlin,et al.  Index-based multidimensional array queries: safety and equivalence , 2007, PODS '07.

[16]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[17]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[18]  Martin L. Kersten,et al.  Distribution Rules for Array Database Queries , 2005, DEXA.