Large-scale spatial join query processing in Cloud

The rapidly increasing amount of location data available in many applications has made it desirable to process their large-scale spatial queries in Cloud for performance and scalability. We report our designs and implementations of two prototype systems that are ready for Cloud deployments: SpatialSpark based on Apache Spark and ISP-MC based on Cloudera Impala. Both systems support indexed spatial joins based on point-in-polygon test and point-to-polyline distance computation. Experiments on the pickup locations of ~170 million taxi trips in New York City and ~10 million global species occurrences records have demonstrated both efficiency and scalability using Amazon EC2 clusters.

[1]  Nong Li,et al.  Runtime Code Generation in Cloudera Impala , 2014, IEEE Data Eng. Bull..

[2]  Yin Yang,et al.  OceanRT: real-time analytics over large temporal data , 2014, SIGMOD Conference.

[3]  Ahmed Eldawy,et al.  A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data , 2013, Proc. VLDB Endow..

[4]  Hanan Samet,et al.  Spatial join techniques , 2007, TODS.

[5]  Gang Chen,et al.  Efficient In-memory Data Management: An Analysis , 2014, Proc. VLDB Endow..

[6]  Jianting Zhang,et al.  Speeding up large-scale point-in-polygon test based spatial join on GPUs , 2012, BigSpatial '12.

[7]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[8]  Antony I. T. Rowstron,et al.  Scale-up vs scale-out for Hadoop: time to rethink? , 2013, SoCC.

[9]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[10]  Xiaolan Xie,et al.  On Massive Spatial Data Retrieval Based on Spark , 2014, WAIM Workshops.

[11]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[12]  Le Gruenwald,et al.  Parallel online spatial and temporal aggregations on multi-core CPUs and many-core GPUs , 2014, Inf. Syst..

[13]  Jimmy J. Lin,et al.  Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems , 2014, EDBT.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[16]  R. Shackleton A Quantitative Approach , 2005 .