An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications

In this work, we propose a new approach of deriving the bounds between entropy and error from a joint distribution through an optimization means. The specific case study is given on binary classifications. Two basic types of classification errors are investigated, namely, the Bayesian and non-Bayesian errors. The consideration of non-Bayesian errors is due to the facts that most classifiers result in non-Bayesian solutions. For both types of errors, we derive the closed-form relations between each bound and error components. When Fano’s lower bound in a diagram of “Error Probability vs. Conditional Entropy” is realized based on the approach, its interpretations are enlarged by including non-Bayesian errors and the two situations along with independent properties of the variables. A new upper bound for the Bayesian error is derived with respect to the minimum prior probability, which is generally tighter than Kovalevskij’s upper bound.

[1]  Y. Wang,et al.  Evaluation Criteria Based on Mutual Information for Classifications Including Rejected Class , 2008 .

[2]  Robert Mario Fano Fano inequality , 2008, Scholarpedia.

[3]  Igor Vajda,et al.  Generalized information criteria for Bayes decisions , 2012, Kybernetika.

[4]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[5]  Bao-Gang Hu,et al.  What Are the Differences Between Bayesian Classifiers and Mutual-Information Classifiers? , 2011, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[7]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[8]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[9]  Jana Zvárová,et al.  On generalized entropies, Bayesian decisions and statistical diversity , 2007, Kybernetika.

[10]  Michael I. Jordan On statistics, computation and scalability , 2013, ArXiv.

[11]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[12]  C. H. CHEN,et al.  Theoretical Comparison of a Class of Feature Selection Criteria in Pattern Recognition , 1971, IEEE Transactions on Computers.

[13]  Jovan Dj. Golic Comment on 'Relations Between Entropy and Error Probability' , 1999, IEEE Trans. Inf. Theory.

[14]  John W. Fisher,et al.  Estimation of Signal Information Content for Classification , 2009, 2009 IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop.

[15]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[16]  Venkat R. Subramanian,et al.  Symbolic solutions for boundary value problems using Maple , 2000 .

[17]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[18]  Neri Merhav,et al.  Relations between entropy and error probability , 1994, IEEE Trans. Inf. Theory.

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  Inder Jeet Taneja,et al.  Generalized error bounds in pattern recognition , 1985, Pattern Recognit. Lett..

[21]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[22]  H. Vincent Poor,et al.  A lower bound on the probability of error in multihypothesis testing , 1995, IEEE Trans. Inf. Theory.

[23]  Ali R. Ansari,et al.  A semi-analytical iterative technique for solving nonlinear problems , 2011, Comput. Math. Appl..

[24]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[25]  J. T. Chu,et al.  Inequalities between information measures and error probability , 1966 .

[26]  Bao-Gang Hu,et al.  Information Theory and its Relation to Machine Learning , 2015, ArXiv.

[27]  M. Ben-Bassat,et al.  Renyi's entropy and the probability of error , 1978, IEEE Trans. Inf. Theory.

[28]  Peter Harremoës,et al.  Inequalities between entropy and index of coincidence derived from information diagrams , 2001, IEEE Trans. Inf. Theory.

[29]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[30]  Sergio Verdú,et al.  Generalizing the Fano inequality , 1994, IEEE Trans. Inf. Theory.

[31]  Sergio Verdú,et al.  On the Interplay Between Conditional Entropy and Error Probability , 2010, IEEE Transactions on Information Theory.

[32]  Jovan Dj. Golic,et al.  On the relationship between the information measures and the Bayes probability of error , 1987, IEEE Trans. Inf. Theory.

[33]  Samuel J. Dwyer,et al.  Uncertainty and the probability of error (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[35]  Yong Wang,et al.  Derivations of Normalized Mutual Information in Binary Classifications , 2007, 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery.

[36]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[37]  Deniz Erdogmus,et al.  Lower and Upper Bounds for Misclassification Probability Based on Renyi's Information , 2004, J. VLSI Signal Process..

[38]  Ran He,et al.  Information-Theoretic Measures for Objective Evaluation of Classifications , 2011, ArXiv.

[39]  Chungyong Lee,et al.  An information-theoretic perspective on feature selection in speaker recognition , 2005, IEEE Signal Processing Letters.

[40]  I. Vajda,et al.  Generalized information criteria for optimal Bayes decisions 1 Research Report No . 2239 December 2008 Generalized information criteria for optimal Bayes decisions , 2009 .

[41]  Sergio Verdú,et al.  Fifty Years of Shannon Theory , 1998, IEEE Trans. Inf. Theory.

[42]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[43]  Raymond W. Yeung,et al.  A First Course in Information Theory , 2002 .