Kangaroo: Workload-Aware Processing of Range Data and Range Queries in Hadoop

Despite the importance and widespread use of range data, e.g., time intervals, spatial ranges, etc., little attention has been devoted to study the processing and querying of range data in the context of big data. The main challenge relies in the nature of the traditional index structures e.g., B-Tree and R-Tree, being centralized by nature, and hence are almost crippled when deployed in a distributed environment. To address this challenge, this paper presents Kangaroo, a system built on top of Hadoop to optimize the execution of range queries over range data. The main idea behind Kangaroo is to split the data into non-overlapping partitions in a way that minimizes the query execution time. Kangaroo is query workload-aware, i.e., results in partitioning layouts that minimize the query processing time of given query patterns. In this paper, we study the design challenges Kangaroo addresses in order to be deployed on top of a distributed file system, i.e., HDFS. We also study four different partitioning schemes that Kangaroo can support. With extensive experiments using real range data of more than one billion records and real query workload of more than 30,000 queries, we show that the partitioning schemes of Kangaroo can significantly reduce the I/O of range queries on range data.

[1]  Meina Song,et al.  THE optimization of HDFS based on small files , 2010, 2010 3rd IEEE International Conference on Broadband Network and Multimedia Technology (IC-BNMT).

[2]  Nimrod Megiddo,et al.  Range queries in OLAP data cubes , 1997, SIGMOD '97.

[3]  Jorge-Arnulfo Quiané-Ruiz,et al.  Only Aggressive Elephants are Fast Elephants , 2012, Proc. VLDB Endow..

[4]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[5]  Guido Moerkotte,et al.  Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing , 1998, VLDB.

[6]  Jinyun Fang,et al.  Multi-dimensional Index on Hadoop Distributed File System , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[7]  Yinan Li,et al.  Overview of Turn Data Management Platform for Digital Advertising , 2013, Proc. VLDB Endow..

[8]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[9]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[10]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[11]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[12]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[13]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[14]  Hanan Samet,et al.  Foundations of multidimensional and metric data structures , 2006, Morgan Kaufmann series in data management systems.

[15]  Richard Beigel,et al.  The Geometry of Browsing , 1998, LATIN.

[16]  Jignesh M. Patel,et al.  Column-Oriented Storage Techniques for MapReduce , 2011, Proc. VLDB Endow..

[17]  Jun Wang,et al.  Improving metadata management for small files in HDFS , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[18]  Carlo Curino,et al.  Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems , 2012, SIGMOD Conference.

[19]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[20]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[21]  Dafang Zhang,et al.  A Strategy to Deal with Mass Small Files in HDFS , 2014, 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics.

[22]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[23]  Shashi Shekhar,et al.  Encyclopedia of GIS , 2007, Encyclopedia of GIS.

[24]  Songting Chen,et al.  Cheetah , 2010, Proc. VLDB Endow..

[25]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[26]  Yawei Li,et al.  Megastore: Providing Scalable, Highly Available Storage for Interactive Services , 2011, CIDR.

[27]  Samuel Madden,et al.  CARTILAGE: adding flexibility to the Hadoop skeleton , 2013, SIGMOD '13.

[28]  Peter J. Haas,et al.  Eagle-eyed elephant: split-oriented indexing in Hadoop , 2013, EDBT '13.