CG_Hadoop: computational geometry in MapReduce

Hadoop, employing the MapReduce programming paradigm, has been widely accepted as the standard framework for analyzing big data in distributed environments. Unfortunately, this rich framework was not truly exploited towards processing large-scale computational geometry operations. This paper introduces CG_Hadoop; a suite of scalable and efficient MapReduce algorithms for various fundamental computational geometry problems, namely, polygon union, skyline, convex hull, farthest pair, and closest pair, which present a set of key components for other geometric algorithms. For each computational geometry operation, CG_Hadoop has two versions, one for the Apache Hadoop system and one for the SpatialHadoop system; a Hadoop-based system that is more suited for spatial operations. These proposed algorithms form a nucleus of a comprehensive MapReduce library of computational geometry operations. Extensive experimental results on a cluster of 25 machines of datasets up to 128GB show that CG_Hadoop achieves up to 29x and 260x better performance than traditional algorithms when using Hadoop and SpatialHadoop systems, respectively.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  Divyakant Agrawal,et al.  MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[3]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  A. M. Andrew,et al.  Another Efficient Algorithm for Convex Hulls in Two Dimensions , 1979, Inf. Process. Lett..

[6]  Feifei Li,et al.  Efficient parallel kNN joins for large data in MapReduce , 2012, EDBT '12.

[7]  W. Marsden I and J , 2012 .

[8]  Dev Oliver,et al.  From Geography to Medicine: Exploring Innerspace via Spatial and Temporal Databases , 2011, SSTD.

[9]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[10]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[11]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[12]  Nial R. Tanvir,et al.  The Hubble Space Telescope data archive , 1997 .

[13]  Ketan Mulmuley,et al.  Computational geometry - an introduction through randomized algorithms , 1993 .

[14]  Chuang Liu,et al.  The Unified Logging Infrastructure for Data Analytics at Twitter , 2012, Proc. VLDB Endow..

[15]  Farnoush Banaei Kashani,et al.  Voronoi-Based Geospatial Query Processing with MapReduce , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[16]  Jinyun Fang,et al.  Multi-dimensional Index on Hadoop Distributed File System , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[17]  Ahmed Eldawy,et al.  A Demonstration of SpatialHadoop: An Efficient MapReduce Framework for Spatial Data , 2013, Proc. VLDB Endow..

[18]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[19]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[20]  Aoying Zhou,et al.  Query processing of massive trajectory data based on mapreduce , 2009, CloudDB@CIKM.

[21]  J. Skilling,et al.  Algorithms and Applications , 1985 .

[22]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[23]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[24]  Kai Wang,et al.  Spatial Queries Evaluation with MapReduce , 2009, 2009 Eighth International Conference on Grid and Cooperative Computing.

[25]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  Ketan Dalal,et al.  Counting the onion , 2004, Random Struct. Algorithms.

[27]  Naphtali Rishe,et al.  Experiences on Processing Spatial Data with MapReduce , 2009, SSDBM.

[28]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[29]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[30]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[31]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[32]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[33]  Ralf Hartmut Güting,et al.  Parallel Secondo: Boosting Database Engines with Hadoop , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[34]  Xuan Song,et al.  Accelerating Spatial Data Processing with MapReduce , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[35]  Thomas Heinis,et al.  Accelerating Range Queries for Brain Simulations , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[36]  Qin Zhang,et al.  Sorting, Searching, and Simulation in the MapReduce Framework , 2011, ISAAC.