Scalable computational geometry in MapReduce

Hadoop, employing the MapReduce programming paradigm, has been widely accepted as the standard framework for analyzing big data in distributed environments. Unfortunately, this rich framework has not been exploited for processing large-scale computational geometry operations. This paper introduces CG_Hadoop; a suite of scalable and efficient MapReduce algorithms for various fundamental computational geometry operations, namely polygon union, Voronoi diagram, skyline, convex hull, farthest pair, and closest pair, which present a set of key components for other geometric algorithms. For each computational geometry operation, CG_Hadoop has two versions, one for the Apache Hadoop system and one for the SpatialHadoop system, a Hadoop-based system that is more suited for spatial operations. These proposed algorithms form the nucleus of a comprehensive MapReduce library of computational geometry operations. Extensive experimental results run on a cluster of 25 machines over datasets of size up to 3.8B records show that CG_Hadoop achieves up to 14x and 115x better performance than traditional algorithms when using Hadoop and SpatialHadoop systems, respectively.

[1]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[2]  Kyuseok Shim,et al.  Parallel Computation of Skyline and Reverse Skyline Queries Using MapReduce , 2013, Proc. VLDB Endow..

[3]  Thomas Heinis,et al.  Accelerating Range Queries for Brain Simulations , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[4]  Naphtali Rishe,et al.  Experiences on Processing Spatial Data with MapReduce , 2009, SSDBM.

[5]  Jing Yang,et al.  Efficient parallel skyline processing using hyperplane projections , 2011, SIGMOD '11.

[6]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[7]  Lars Kulik,et al.  The V*-Diagram: a query-dependent approach to moving KNN queries , 2008, Proc. VLDB Endow..

[8]  Joel H. Saltz,et al.  Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems , 2012, Proc. VLDB Endow..

[9]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[11]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[12]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[13]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[14]  Christopher N. Eichelberger,et al.  Spatio-temporal indexing in non-relational distributed databases , 2013, 2013 IEEE International Conference on Big Data.

[15]  Gang Chen,et al.  ScalaGiST: Scalable Generalized Search Trees for MapReduce Systems [Innovative Systems Paper] , 2014, Proc. VLDB Endow..

[16]  Erik G. Hoel,et al.  Spatial indexing and analytics on Hadoop , 2014, SIGSPATIAL/GIS.

[17]  Ahmed Eldawy,et al.  Spatial Partitioning Techniques in Spatial Hadoop , 2015, Proc. VLDB Endow..

[18]  Jinyun Fang,et al.  Multi-dimensional Index on Hadoop Distributed File System , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[19]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[20]  Xuan Song,et al.  Accelerating Spatial Data Processing with MapReduce , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[23]  Ketan Dalal,et al.  Counting the onion , 2004, Random Struct. Algorithms.

[24]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[25]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[26]  Nial R. Tanvir,et al.  The Hubble Space Telescope data archive , 1997 .

[27]  Dev Oliver,et al.  From Geography to Medicine: Exploring Innerspace via Spatial and Temporal Databases , 2011, SSTD.

[28]  Farnoush Banaei Kashani,et al.  Voronoi-Based Geospatial Query Processing with MapReduce , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[29]  Yuan Yuan,et al.  Major technical advancements in apache hive , 2014, SIGMOD Conference.

[30]  Yannis Manolopoulos,et al.  Enhancing SpatialHadoop with Closest Pair Queries , 2016, ADBIS.

[31]  A. M. Andrew,et al.  Another Efficient Algorithm for Convex Hulls in Two Dimensions , 1979, Inf. Process. Lett..

[32]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[33]  Qin Zhang,et al.  Sorting, Searching, and Simulation in the MapReduce Framework , 2011, ISAAC.

[34]  Aoying Zhou,et al.  Query processing of massive trajectory data based on mapreduce , 2009, CloudDB@CIKM.

[35]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[36]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[37]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[38]  Franco P. Preparata,et al.  Computational Geometry , 1985, Texts and Monographs in Computer Science.

[39]  Leonidas J. Guibas,et al.  Primitives for the manipulation of general subdivisions and the computation of Voronoi diagrams , 1983, STOC.

[40]  Ahmed Eldawy,et al.  A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data , 2013, Proc. VLDB Endow..

[41]  Divyakant Agrawal,et al.  MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[42]  J. Shane Culpepper,et al.  Finding the optimal location and keywords in obstructed and unobstructed space , 2018, The VLDB Journal.

[43]  Xiaoyong Du,et al.  MapReduce based location selection algorithm for utility maximization with capacity constraints , 2013, Computing.

[44]  Xiao Qin,et al.  Efficient Parallel Skyline Evaluation Using MapReduce , 2016, IEEE Transactions on Parallel and Distributed Systems.

[45]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[46]  Kai Wang,et al.  Spatial Queries Evaluation with MapReduce , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[47]  Chuang Liu,et al.  The Unified Logging Infrastructure for Data Analytics at Twitter , 2012, Proc. VLDB Endow..

[48]  Ahmed Eldawy,et al.  CG_Hadoop: computational geometry in MapReduce , 2013, SIGSPATIAL/GIS.

[49]  Divyakant Agrawal,et al.  $\mathcal{MD}$-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services , 2012, Distributed and Parallel Databases.

[50]  Ralf Hartmut Güting,et al.  Parallel Secondo: Boosting Database Engines with Hadoop , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.