Empirically Estimable Classification Bounds Based on a New Divergence Measure

Information divergence functions play a critical role in statistics and information theory. In this paper we show that a non-parametric f-divergence measure can be used to provide improved bounds on the minimum binary classification probability of error for the case when the training and test data are drawn from the same distribution and for the case where there exists some mismatch between training and test distributions. We confirm the theoretical results by designing feature selection algorithms using the criteria from these bounds and by evaluating the algorithms on a series of pathological speech classification tasks.

[1]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[2]  Igor Vajda,et al.  Note on discrimination information and variation (Corresp.) , 1970, IEEE Trans. Inf. Theory.

[3]  X. Guorong,et al.  Bhattacharyya distance feature selection , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[4]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[5]  Imre Csiszár,et al.  Information theory and statistics , 2013 .

[6]  Pramod K. Varshney,et al.  A Tight Upper Bound on the Bayesian Probability of Error , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[8]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[9]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[10]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[11]  Yishay Mansour,et al.  Multiple Source Adaptation and the Rényi Divergence , 2009, UAI.

[12]  Alfred O. Hero,et al.  Estimation of Nonlinear Functionals of Densities With Confidence , 2012, IEEE Transactions on Information Theory.

[13]  László Györfi,et al.  Lower Bounds for Bayes Error Estimation , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[15]  A. Ben Hamza,et al.  Image Registration and Segmentation , 2003, EMMCVPR.

[16]  S. Kullback,et al.  A lower bound for discrimination information in terms of variation (Corresp.) , 1967, IEEE Trans. Inf. Theory.

[17]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[18]  N. Henze,et al.  On the multivariate runs test , 1999 .

[19]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[20]  Imre Csiszár,et al.  Information Theory and Statistics: A Tutorial , 2004, Found. Trends Commun. Inf. Theory.

[21]  Koby Crammer,et al.  Learning Bounds for Domain Adaptation , 2007, NIPS.

[22]  Yishay Mansour,et al.  Domain Adaptation: Learning Bounds and Algorithms , 2009, COLT.

[23]  Alfred O. Hero,et al.  Geodesic entropic graphs for dimension and entropy estimation in manifold learning , 2004, IEEE Transactions on Signal Processing.

[24]  Guorong Xuan,et al.  Feature Selection based on the Bhattacharyya Distance , 2006, ICPR.

[25]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[26]  Visar Berisha,et al.  Modeling pathological speech perception from data with similarity labels , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[28]  Martin J. Wainwright,et al.  ON surrogate loss functions and f-divergences , 2005, math/0510521.

[29]  J. Liss,et al.  Vowel acoustics in dysarthria: speech disorder diagnosis and classification. , 2014, Journal of speech, language, and hearing research : JSLHR.

[30]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[31]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[32]  Sunita Sarawagi,et al.  Domain Adaptation of Conditional Probability Models Via Feature Subsetting , 2007, PKDD.

[33]  Deniz Erdoğmuş,et al.  Blind source separation using Renyi's mutual information , 2001, IEEE Signal Processing Letters.

[34]  George Saon,et al.  Minimum Bayes Error Feature Selection for Continuous Speech Recognition , 2000, NIPS.

[35]  Hadar I. Avi-Itzhak,et al.  Arbitrarily Tight Upper and Lower Bounds on the Bayesian Probability of Error , 1996, IEEE Trans. Pattern Anal. Mach. Intell..