Haggis: turbocharge a MapReduce based spatial data warehousing system with GPU engine

Spatial query processing involves complex multidimensional objects and compute intensive spatial operations, and therefore requires a high performance approach to meet the rapid data analytics requirements of modern spatial applications. Recently, MapReduce based spatial query systems have become a viable solution for many data intensive query tasks, and gained widespread adoption in both academia and industry. At the same time, GPUs have been successfully utilized in many applications that require high performance computation. Both approaches, GPU and MapReduce, have their own limitations and advantages, and have been separately utilized in spatial query processing tasks to boost application performance. However, it is unclear that how MapReduce and GPU, two vastly different parallelization techniques, can be fused together to effectively deal with the spatial big data challenges. In this paper, we explore such synergy of parallelization techniques for large scale spatial query processing. We extend Hadoop-GIS, a MapReduce based spatial query system, and provide GPU accelerated spatial query processing capability at the engine level. We evaluate the system on a real world dataset, and demonstrate that GPU accelerated system can gain considerable performance improvements. We also show how other factors such as partition granularity, task scheduling between CPU and GPU can impact the query performance.

[1]  Joel H. Saltz,et al.  Approximate similarity search for online multimedia services on distributed CPU–GPU platforms , 2012, The VLDB Journal.

[2]  Jack A. Orenstein A comparison of spatial query processing techniques for native and parameter spaces , 1990, SIGMOD '90.

[3]  Joel H. Saltz,et al.  Demonstration of Hadoop-GIS: a spatial data warehousing system over MapReduce , 2013, SIGSPATIAL/GIS.

[4]  Kunle Olukotun,et al.  The Future of Microprocessors , 2005, ACM Queue.

[5]  Metin Nafi Gürcan,et al.  Coordinating the use of GPU and CPU for improving performance of compute intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6]  Naphtali Rishe,et al.  Experiences on Processing Spatial Data with MapReduce , 2009, SSDBM.

[7]  Ryan Johnson,et al.  A parallel spatial data analysis infrastructure for the cloud , 2013, SIGSPATIAL/GIS.

[8]  Joel H. Saltz,et al.  Towards building a high performance spatial query system for large scale medical imaging data , 2012, SIGSPATIAL/GIS.

[9]  Fusheng Wang,et al.  High performance spatial query processing for large scale scientific data , 2012, PhD '12.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Ahmed Eldawy,et al.  A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data , 2013, Proc. VLDB Endow..

[12]  Ümit V. Çatalyürek,et al.  Optimizing dataflow applications on heterogeneous environments , 2010, Cluster Computing.

[13]  Zhiyong Xu,et al.  SJMR: Parallelizing spatial join with MapReduce on clusters , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[14]  Ümit V. Çatalyürek,et al.  Automatic dataflow application tuning for heterogeneous systems , 2010, 2010 International Conference on High Performance Computing.

[15]  L. Venkata Subramaniam,et al.  Processing multi-way spatial joins on map-reduce , 2013, EDBT '13.

[16]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[17]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Joel H. Saltz,et al.  Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems , 2012, Proc. VLDB Endow..

[19]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[20]  Cédric Augonnet,et al.  StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators , 2012, EuroMPI.

[21]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[22]  Ralf Hartmut Güting,et al.  Parallel SECONDO: Practical and efficient mobility data processing in the cloud , 2013, 2013 IEEE International Conference on Big Data.

[23]  Jianting Zhang,et al.  Speeding up large-scale point-in-polygon test based spatial join on GPUs , 2012, BigSpatial '12.

[24]  Sushil K. Prasad,et al.  MPI-GIS: New Parallel Overlay Algorithm and System Prototype , 2014 .

[25]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[26]  Le Gruenwald,et al.  Parallel spatial query processing on GPUs using R-trees , 2013, BigSpatial '13.

[27]  Suprio Ray,et al.  Surveying the landscape: an in-depth analysis of spatial database workloads , 2012, SIGSPATIAL/GIS.

[28]  Farnoush Banaei Kashani,et al.  Voronoi-Based Geospatial Query Processing with MapReduce , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[29]  Guihai Chen,et al.  Towards Parallel Spatial Query Processing for Big Spatial Data , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.