Transactions on Large-Scale Data- and Knowledge-Centered Systems XXVIII

We propose a scheme for efficient set similarity joins on Graphics Processing Units (GPUs). Due to the rapid growth and diversification of data, there is an increasing demand for fast execution of set similarity joins in applications that vary from data integration to plagiarism detection. To tackle this problem, our solution takes advantage of the massive parallel processing offered by GPUs. Additionally, we employ MinHash to estimate the similarity between two sets in terms of Jaccard similarity. By exploiting the high parallelism of GPUs and the space efficiency provided by MinHash, we can achieve high performance without sacrificing accuracy. Experimental results show that our proposed method is more than two orders of magnitude faster than the serial version of CPU implementation, and 25 times faster than the parallel version of CPU implementation, while generating highly precise query results.

[1]  Patrick Valduriez,et al.  CloudMdsQL: querying heterogeneous cloud data stores with a common language , 2016, Distributed and Parallel Databases.

[2]  Richard S. Sutton,et al.  Dimensions of Reinforcement Learning , 1998 .

[3]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[4]  Philipp Rösch,et al.  A Storage Advisor for Hybrid-Store Databases , 2012, Proc. VLDB Endow..

[5]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[6]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[7]  Per-Åke Larson,et al.  A query sampling method for estimating local cost parameters in a multidatabase system , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[8]  Neoklis Polyzotis,et al.  A Benchmark for Online Index Selection , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[10]  Edward L. Robertson,et al.  Relational languages for metadata integration , 2005, TODS.

[11]  Stephen P. Brooks,et al.  Markov Decision Processes. , 1995 .

[12]  Yannis Papakonstantinou,et al.  The SQL++ Semi-structured Data Model and Query Language: A Capabilities Survey of SQL-on-Hadoop, NoSQL and NewSQL Databases , 2014, ArXiv.

[13]  Kevin Wilkinson,et al.  Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[14]  Stanley B. Zdonik,et al.  An automatic physical design tool for clustered column-stores , 2013, EDBT '13.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  Qiang Zhu,et al.  Global Query Processing and Optimization in the CORDS Multidatabase System , 1996 .

[17]  Ioana Manolescu,et al.  Invisible Glue: Scalable Self-Tunning Multi-Stores , 2015, CIDR.

[18]  Tao Zou,et al.  Joins for Hybrid Warehouses: Exploiting Massive Parallelism in Hadoop and Enterprise Data Warehouses , 2015, EDBT.

[19]  Verena Rieser,et al.  A comparison of genetic algorithms and reinforcement learning for optimising sustainable forest management , 2011 .

[20]  Patrick Valduriez,et al.  Integrating Big Data and Relational Data with a Functional SQL-like Query Language , 2015, DEXA.

[21]  Hakan Hacigümüs,et al.  MISO: souping up big data query processing with a multistore system , 2014, SIGMOD Conference.

[22]  Patrick Valduriez,et al.  Functional SOL (FSOL), an SQL upward-compatible database programming language , 1992, Inf. Sci..

[23]  Francois Raab,et al.  TPC-C - The Standard Benchmark for Online transaction Processing (OLTP) , 1993, The Benchmark Handbook.

[24]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[25]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[26]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[27]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[28]  Carsten Binnig,et al.  FunSQL: it is time to make SQL functional , 2012, EDBT-ICDT '12.

[29]  Serge Abiteboul,et al.  On-Line Index Selection for Shifting Workloads , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[30]  Hakan Hacigümüs,et al.  Odyssey: A Multi-Store System for Evolutionary Analytics , 2013, Proc. VLDB Endow..

[31]  Neoklis Polyzotis,et al.  Semi-Automatic Index Tuning: Keeping DBAs in the Loop , 2012, Proc. VLDB Endow..

[32]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[33]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[34]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[35]  Tore Risch,et al.  Querying combined cloud-based and relational databases , 2011, 2011 International Conference on Cloud and Service Computing.

[36]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[37]  David J. DeWitt,et al.  Split query processing in polybase , 2013, SIGMOD '13.

[38]  Peter C. Young Recursive Least Squares Estimation , 2011 .

[39]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[40]  Patrick Valduriez,et al.  Principles of Distributed Database Systems , 1990 .

[41]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..