A local dependence measure and its application to screening for high correlations in large data sets

Correlation screening is frequently the only practical way to discover dependencies in very high dimensional data. In correlation screening a high threshold is applied to the matrix of sample correlation coefficients of the multivariate data. The variables having coefficients that exceed the threshold are called discoveries and are classified to be dependent. The mean number of discoveries and the number of false discoveries in correlation screening problems depend on a information-theoretic measure J, a novel type of information divergence that is a function of the joint density of pairs of variables. It is therefore important to estimate J in order to determine screening thresholds for desired false alarm rates. In this paper, we propose a kernel estimator for J, establish asymptotic consistency and determine the asymptotic distribution of the estimator. These results are used to minimize the MSE of the estimator and to determine confidence intervals on J. We use these results to test for dependence between variables in both simulated data sets and also between email spam harvesters. Finally, we use the estimate of J to determine screening thresholds in correlation screening problems involving gene expression data.