A method for combining mutual information and canonical correlation analysis: Predictive Mutual Information and its use in feature selection

Highlights? We propose a hybrid measure of relevance based on MI and KCCA. ? Our measure PMI weighs more the samples with predictive powers. ? PMI effectively eliminates the samples with no predictive contribution. ? We show that PMI has improved feature detection capability. Feature selection is a critical step in many artificial intelligence and pattern recognition problems. Shannon's Mutual Information (MI) is a classical and widely used measure of dependence measure that serves as a good feature selection algorithm. However, as it is a measure of mutual information in average, under-sampled classes (rare events) can be overlooked by this measure, which can cause critical false negatives (missing a relevant feature very predictive of some rare but important classes). Shannon's mutual information requires a well sampled database, which is not typical of many fields of modern science (such as biomedical), in which there are limited number of samples to learn from, or at least, not all the classes of the target function (such as certain phenotypes in biomedical) are well-sampled. On the other hand, Kernel Canonical Correlation Analysis (KCCA) is a nonlinear correlation measure effectively used to detect independence but its use for feature selection or ranking is limited due to the fact that its formulation is not intended to measure the amount of information (entropy) of the dependence. In this paper, we propose a hybrid measure of relevance, Predictive Mutual Information (PMI) based on MI, which also accounts for predictability of signals from each other in its calculation as in KCCA. We show that PMI has more improved feature detection capability than MI, especially in catching suspicious coincidences that are rare but potentially important not only for experimental studies but also for building computational models. We demonstrate the usefulness of PMI, and superiority over MI, on both toy and real datasets.

[1]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[2]  Serguei Novak,et al.  On Gebelein's correlation coefficient , 2004 .

[3]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[4]  Olcay Kursun,et al.  A Hybrid Method for Feature Selection Based on Mutual Information and Canonical Correlation Analysis , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[6]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[7]  A. Dembo,et al.  On the Maximum Correlation Coefficient , 2005 .

[8]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[9]  Olcay Kursun,et al.  SINBAD automation of scientific discovery: From factor analysis to theory synthesis , 2004, Natural Computing.

[10]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Oleg V. Favorov,et al.  SINBAD: A neocortical mechanism for discovering environmental variables and regularities hidden in sensory input , 2004, Biological Cybernetics.

[12]  H. Gebelein Das statistische Problem der Korrelation als Variations‐ und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung , 1941 .

[13]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[14]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[15]  C. B. Bell Mutual Information and Maximal Correlation as Measures of Dependence , 1962 .

[16]  A. Rényi On measures of dependence , 1959 .

[17]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[18]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[20]  Olcay Kursun,et al.  Feature Selection and Extraction Using an Unsupervised Biologically-Suggested Approximation to Gebelein's Maximal Correlation , 2010, Int. J. Pattern Recognit. Artif. Intell..

[21]  H. A. Guvenir,et al.  A supervised machine learning algorithm for arrhythmia analysis , 1997, Computers in Cardiology 1997.

[22]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[23]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[24]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[25]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[26]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[27]  Casanova Gurrera,et al.  Construction of bivariate distributions and statistical dependence operations , 2005 .

[28]  E Mjolsness,et al.  Machine learning for science: state of the art and future prospects. , 2001, Science.

[29]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[30]  Peter Földiák,et al.  Bayesian bin distribution inference and mutual information , 2005, IEEE Transactions on Information Theory.