A Matrix Iteration Algorithm With Pruning for Pinpointing Multivariate Correlations From High Dimensional Data Sets

There are a few dependent multivariate relationships among high dimensional data sets. Then how to identify these dependent variables from high dimensional data sets is an important issue for data analysis. Now, the most frequently used method is the enumeration method, that is all multivariate relationships in the high dimensional data sets are examined. However, the time complexity of the enumeration method is exponential ( $2^{n}$ ) and the calculation load is very heavy when the dimension is high. Aiming at solving this problem, the matrix iteration algorithm with pruning (MIP) is proposed for pinpointing multivariate dependent relationships in high dimensional data sets without examining all multivariate relationships. Some not dependent relationships are ignored without examined by the pruning process of the proposed MIP and the computing burden is reduced. The maximal information coefficient (MIC) is adopted as the measure of correlations in the proposed MIP algorithm due to the excellent properties, generality and equitability, of MIC. In the case of the data set with 5 variables, more than 50% multivariate relationships are pruned without examining. Numerical experiments also show that the calculating burden is greatly reduced. Compared to the enumeration method, 82.5% calculating time and 98.5% calculating times of multivariate relationships are saved for the data set with two dependent multivariate relationships among 30 variables in the experiment. The proposed MIP algorithm is effective for pinpointing multivariate dependent relationships from data sets with high dimensions.

[1]  Maria L. Rizzo,et al.  Brownian distance covariance , 2009, 1010.0297.

[2]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[3]  Malka Gorfine,et al.  Consistent Distribution-Free $K$-Sample and Independence Tests for Univariate Random Variables , 2014, J. Mach. Learn. Res..

[4]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  David N. Reshef,et al.  An empirical study of the maximal and total information coefficients and leading measures of dependence , 2018 .

[6]  Davide Albanese,et al.  A practical tool for maximal information coefficient analysis , 2017, bioRxiv.

[7]  Michael Mitzenmacher,et al.  Equitability Analysis of the Maximal Information Coefficient, with Comparisons , 2013, ArXiv.

[8]  Kenji Fukumizu,et al.  Equivalence of distance-based and RKHS-based statistics in hypothesis testing , 2012, ArXiv.

[9]  Chunxiao Jiang,et al.  Information Security in Big Data: Privacy and Data Mining , 2014, IEEE Access.

[10]  R. Heller,et al.  A consistent multivariate test of association based on ranks of distances , 2012, 1201.3522.

[11]  Tamás F. Móri,et al.  Four simple axioms of dependence measures , 2018, Metrika.

[12]  Geoffrey J. Goodhill,et al.  Limitations to Estimating Mutual Information in Large Neural Populations , 2020, Entropy.

[13]  Uri Erez,et al.  On the Importance of Asymmetry and Monotonicity Constraints in Maximal Correlation Analysis , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[14]  Keping Li,et al.  Identifying multi-variable relationships based on the maximal information coefficient , 2017, Intell. Data Anal..

[15]  Johan Segers,et al.  Multivariate generalized Pareto distributions: Parametrizations, representations, and properties , 2017, J. Multivar. Anal..

[16]  Ann. Probab Distance Covariance in Metric Spaces , 2017 .

[17]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[18]  Olha Buchel,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2015 .

[19]  Klemens Böhm,et al.  Multivariate Maximal Correlation Analysis , 2014, ICML.

[20]  Dong Wang,et al.  The big data analysis of rail equipment accidents based on the maximal information coefficient , 2020, Journal of Transportation Safety & Security.

[21]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[22]  Reza Modarres,et al.  Measures of Dependence , 2011, International Encyclopedia of Statistical Science.

[23]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[24]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[25]  Keping Li,et al.  Railway accidents analysis based on the improved algorithm of the maximal information coefficient , 2016, Intell. Data Anal..