Detecting correlated columns in relational databases with mixed data types

In a database, besides known dependencies among columns (e.g., foreign key and primary key constraints), there are many other correlations unknown to the database users. Extraction of such hidden correlations is known to be useful for various tasks in database optimization and data analytics. However, the task is challenging due to the lack of measures to quantify column correlations. Correlations may exist among columns of different data types and value domains, which makes techniques based on value matching inapplicable. Besides, a column may have multiple semantics, which does not allow disjoint partitioning of columns. Finally, from a computational perspective, one has to consider a huge search space that grows exponentially with the number of columns. In this paper, we present a novel method for detecting column correlations (DeCoRel). It aims at discovering overlapping groups of correlated columns with mixed data types in relational databases. To handle the heterogeneity of data types, we propose a new correlation measure that combines the good features of Shannon entropy and cumulative entropy. To address the huge search space, we introduce an efficient algorithm for the column grouping. Compared to state of the art techniques, we show our method to be more general than one of the most recent approaches in the database literature. Experiments reveal that our method achieves both higher quality and better scalability than existing techniques.

[1]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[2]  Beng Chin Ooi,et al.  Automatic discovery of attributes in relational databases , 2011, SIGMOD '11.

[3]  Alekh Jindal,et al.  A Comparison of Knives for Bread Slicing , 2013, Proc. VLDB Endow..

[4]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[5]  Jian Pei,et al.  Mining frequent cross-graph quasi-cliques , 2009, TKDD.

[6]  Renée J. Miller,et al.  Information-theoretic tools for mining database structure from large data sets , 2004, SIGMOD '04.

[7]  Paul Brown,et al.  GORDIAN: efficient and scalable discovery of composite keys , 2006, VLDB.

[8]  Srinivasan Parthasarathy,et al.  Toward unsupervised correlation preserving discretization , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Jilles Vreeken,et al.  Summarizing categorical data by clustering attributes , 2011, Data Mining and Knowledge Discovery.

[10]  Jeffrey F. Naughton,et al.  On schema matching with opaque column names and data values , 2003, SIGMOD '03.

[11]  Stanley B. Zdonik,et al.  CORADD , 2010, Proc. VLDB Endow..

[12]  Paul Brown,et al.  BHUNT: Automatic Discovery of Fuzzy Algebraic Constraints in Relational Data , 2003, VLDB.

[13]  Andrew B. Nobel,et al.  Mining non-redundant high order correlations in binary data , 2008, Proc. VLDB Endow..

[14]  Yunmei Chen,et al.  Cumulative residual entropy: a new measure of information , 2004, IEEE Transactions on Information Theory.

[15]  Yunmei Chen,et al.  A test of independence based on a generalized correlation function , 2011, Signal Process..

[16]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[17]  Guimei Liu,et al.  Effective Pruning Techniques for Mining Quasi-Cliques , 2008, ECML/PKDD.

[18]  Divesh Srivastava,et al.  Summary graphs for relational database schemas , 2011, Proc. VLDB Endow..

[19]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[20]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[21]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[22]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[23]  Jessika Weiss,et al.  Graphical Models In Applied Multivariate Statistics , 2016 .

[24]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[25]  Klemens Böhm,et al.  CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection , 2013, SDM.

[26]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[27]  Xudong Jiang,et al.  Complete discriminant evaluation and feature extraction in kernel space for face recognition , 2008, Machine Vision and Applications.

[28]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[29]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[30]  S. Rachev The Monge–Kantorovich Mass Transference Problem and Its Stochastic Applications , 1985 .

[31]  Felix Naumann,et al.  Scalable Discovery of Unique Column Combinations , 2013, Proc. VLDB Endow..

[32]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[33]  Ira Assent,et al.  DensEst: Density Estimation for Data Mining in High Dimensional Spaces , 2009, SDM.

[34]  Divesh Srivastava,et al.  Type-based categorization of relational attributes , 2009, EDBT '09.

[35]  Klemens Böhm,et al.  4S: Scalable subspace search scheme overcoming traditional Apriori processing , 2013, 2013 IEEE International Conference on Big Data.