The similarity-aware relational division database operator with case studies in agriculture and genetics

Abstract In Relational Algebra, the operator Division ( ÷ ) is an intuitive tool used to write queries with the concept of “for all”, and thus, it is constantly required in real applications. However, as we demonstrate here, the division does not support many of the needs common to modern applications, particularly those that involve complex data analysis, such as processing images, audio, genetic data, large graphs, fingerprints, and many other “non-traditional” data types. The main issue is the existence of intrinsic comparisons of attribute values in the operator, which, by definition, are always performed by identity (=), despite the fact that complex data must be compared by similarity. Recent works focus on supporting similarity comparison in relational operators, but no one treats the division. This paper presents the new Similarity-aware Division ( ÷ ˆ ) operator. Our novel operator is naturally well suited to answer queries with an idea of “candidate elements and exigencies” to be performed on complex data from modern applications. For example, it is potentially useful to support agriculture, genetic analyses, digital library search, prospective client identification, and even to help controlling the quality of manufactured products in industry. We validate our proposals by studying the first two of these applications.

[1]  Walid G. Aref,et al.  SimDB: a similarity-aware database system , 2010, SIGMOD Conference.

[2]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[3]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[4]  M. Carmen Garrido,et al.  Fuzzy division in fuzzy relational databases: an approach , 2001, Fuzzy Sets Syst..

[5]  Patrick Bosc,et al.  A new semantics for the division of fuzzy relations in relational databases , 1999, EUSFLAT-ESTYLF Joint Conf..

[6]  Pavel Zezula,et al.  Query Language for Complex Similarity Queries , 2012, ADBIS.

[7]  José Galindo,et al.  Relaxing the universal quantifier of the division in fuzzy relational databases , 2001, Int. J. Intell. Syst..

[8]  Agma J. M. Traina,et al.  Seamlessly integrating similarity queries in SQL , 2009 .

[9]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithms for Clustering - Applications in Data Mining and Bioinformatics , 2011 .

[10]  Ludovic Lietard,et al.  A relational division based on a fuzzy bipolar R-implication operator , 2013, 2013 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[11]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[12]  Walid G. Aref,et al.  Similarity Group-By operators for multi-dimensional relational data , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[13]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[14]  N. S. Rebello,et al.  Supervised and Unsupervised Spectral Angle Classifiers , 2002 .

[15]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[16]  Robson L. F. Cordeiro,et al.  Fast and Scalable Relational Division on Database Systems , 2016, SBBD.

[17]  Agma J. M. Traina,et al.  Parameter-free and domain-independent similarity search with diversity , 2013, SSDBM.

[18]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[19]  Pavel Zezula,et al.  Similarity Search - The Metric Space Approach , 2005, Advances in Database Systems.

[20]  Dmitri V. Kalashnikov,et al.  Super-EGO: fast multi-dimensional similarity join , 2013, The VLDB Journal.

[21]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[22]  Jan Van den Bussche,et al.  On the complexity of division and set joins in the relational algebra , 2005, PODS '05.

[23]  Christos Faloutsos,et al.  Halite: Fast and Scalable Multiresolution Local-Correlation Clustering , 2013, IEEE Transactions on Knowledge and Data Engineering.

[24]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[25]  Guoliang Li,et al.  String similarity search and join: a survey , 2016, Frontiers of Computer Science.

[26]  Walid G. Aref,et al.  Similarity Group-By , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[27]  E. F. Codd,et al.  The Relational Model for Database Management, Version 2 , 1990 .

[28]  Vilém Vychodil,et al.  Query systems in similarity-based databases: logical foundations, expressive power, and completeness , 2010, SAC '10.

[29]  Yasin N. Silva,et al.  Similarity Joins: Their implementation and interactions with other database operators , 2015, Inf. Syst..

[30]  Yasin N. Silva,et al.  Exploiting Database Similarity Joins for Metric Spaces , 2012, Proc. VLDB Endow..

[31]  Agma J. M. Traina,et al.  FMI-SiR: A Flexible and Efficient Module for Similarity Searching on Oracle Database , 2010, J. Inf. Data Manag..

[32]  Agma J. M. Traina,et al.  A New Concept of Sets to Handle Similarity in Databases: The SimSets , 2013, SISAP.

[33]  Olga Pons,et al.  GEFRED: A Generalized Model of Fuzzy Relational Databases , 1994, Inf. Sci..

[34]  José Galindo,et al.  Fuzzy Databases: Modeling, Design, and Implementation , 2006 .

[35]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[36]  Agma J. M. Traina,et al.  MedFMI-SiR: A Powerful DBMS Solution for Large-Scale Medical Image Retrieval , 2011, ITBAM.

[37]  Agma J. M. Traina,et al.  Similarity sets: A new concept of sets to seamlessly handle similarity in database management systems , 2015, Inf. Syst..

[38]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[39]  Walid G. Aref,et al.  Similarity queries: their conceptual evaluation, transformations, and processing , 2013, The VLDB Journal.

[40]  Walid G. Aref,et al.  The similarity-aware relational database set operators , 2016, Inf. Syst..

[41]  E. F. Codd,et al.  Relational Completeness of Data Base Sublanguages , 1972, Research Report / RJ / IBM / San Jose, California.

[42]  Walid G. Aref,et al.  The Similarity-Aware Relational Intersect Database Operator , 2014, SISAP.