Supporting Similarity Queries in Apache AsterixDB

Many applications require similarity query processing. Most existing work took an algorithmic approach, developing indexing structures, algorithms, and/or various optimizations. In this work, we choose to take a different, systems-oriented approach. We describe the support for similarity queries in Apache AsterixDB, a parallel, open-source Big Data management system for NoSQL data. We describe the lifecycle of a similarity query in the system, including the support provided at the query language level, indexing, execution plans (with and without indexes), plan rewrites to optimize query execution, and so on. Our approach leverages the existing infrastructure of AsterixDB, including its operators, parallel query engine, and rule-based query optimizer. We have conducted an experimental study using several large, real data sets on a parallel computing cluster to evaluate AsterixDB’s support for similarity queries, and we share the efficacy and performance results here.

[1]  Nikos Mamoulis,et al.  Spatio-textual similarity joins , 2012, Proc. VLDB Endow..

[2]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[3]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[4]  Zhifeng Bao,et al.  Dima: A Distributed In-Memory Similarity-Based Query Processing System , 2017, Proc. VLDB Endow..

[5]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[6]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[7]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[8]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[9]  Xuemin Lin,et al.  Efficient exact edit similarity query processing with the asymmetric signature scheme , 2011, SIGMOD '11.

[10]  Jiaheng Lu,et al.  Space-Constrained Gram-Based Indexing for Efficient Approximate String Search , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[11]  Guoliang Li,et al.  Trie-join: a trie-based method for efficient string similarity joins , 2012, The VLDB Journal.

[12]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[14]  Heng Tao Shen,et al.  VChunkJoin: An Efficient Algorithm for Edit Similarity Joins , 2013, IEEE Transactions on Knowledge and Data Engineering.

[15]  Theo Härder,et al.  Generalizing prefix filtering to improve set similarity joins , 2011, Inf. Syst..

[16]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[17]  Guoliang Li,et al.  A pivotal prefix based filtering algorithm for string similarity search , 2014, SIGMOD Conference.

[18]  Daniel J. Brass,et al.  Network Analysis in the Social Sciences , 2009, Science.

[19]  Michael J. Carey,et al.  Algebricks: a data model-agnostic compiler backend for big data languages , 2015, SoCC.

[20]  Chen Li,et al.  Storage Management in AsterixDB , 2014, Proc. VLDB Endow..

[21]  Nikolaus Augsten,et al.  PEL: Position-Enhanced Length Filter for Set Similarity Joins , 2014, Grundlagen von Datenbanken.

[22]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[23]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[24]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[25]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[26]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[27]  Yasin N. Silva,et al.  Similarity Joins: Their implementation and interactions with other database operators , 2015, Inf. Syst..

[28]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[29]  Surajit Chaudhuri,et al.  Data Debugger: An Operator-Centric Approach for Data Quality Solutions , 2006, IEEE Data Eng. Bull..

[30]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[31]  Donald Kossmann,et al.  The state of the art in distributed query processing , 2000, CSUR.

[32]  Srinivasan Parthasarathy,et al.  Scalable all-pairs similarity search in metric spaces , 2013, KDD.

[33]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[34]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[35]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[36]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[37]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[38]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[39]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[40]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.