Beyond Fano's inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications

Fano's inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive. Two other popular criteria are balanced error rate (BER) and F-score. In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon's as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano's result. As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk. For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency. In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier. As is widely known, a threshold of 0.5, where the posteriors cross, minimizes error rate--we derive similar optimal thresholds for F-score and BER.

[1]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[2]  Deniz Erdogmus,et al.  Lower and Upper Bounds for Misclassification Probability Based on Renyi's Information , 2004, J. VLSI Signal Process..

[3]  Kim C. Border,et al.  Infinite Dimensional Analysis: A Hitchhiker’s Guide , 1994 .

[4]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[7]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[8]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[9]  Jovan Dj. Golic,et al.  On the relationship between the information measures and the Bayes probability of error , 1987, IEEE Trans. Inf. Theory.

[10]  G. A. Barnard,et al.  Transmission of Information: A Statistical Theory of Communications. , 1961 .

[11]  Martin Jansche,et al.  A Maximum Expected Utility Framework for Binary Sequence Labeling , 2007, ACL.

[12]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[13]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[14]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[15]  Inder Jeet Taneja,et al.  On Generalized Information Measures and Their Applications , 1989 .

[16]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[17]  Ralph Linsker,et al.  An Application of the Principle of Maximum Information Preservation to Linear Systems , 1988, NIPS.

[18]  G. Crooks On Measures of Entropy and Information , 2015 .

[19]  David G. Stork,et al.  Pattern Classification , 1973 .

[20]  Moshe Ben-Bassat,et al.  f-Entropies, probability of Error, and Feature Selection , 1978, Inf. Control..

[21]  J. Príncipe,et al.  Information-Theoretic Learning Using Renyi's Quadratic Entropy , 1999 .

[22]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[23]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[24]  Deniz Erdogmus,et al.  Feature extraction using information-theoretic learning , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[26]  Ingo Steinwart How to Compare Different Loss Functions and Their Risks , 2007 .

[27]  Son Lam Phung,et al.  Learning Pattern Classification Tasks with Imbalanced Data Sets , 2009 .

[28]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[29]  Dan Roth,et al.  Understanding Probabilistic Classifiers , 2001, ECML.

[30]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[31]  Samuel J. Dwyer,et al.  Uncertainty and the probability of error (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[32]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[33]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[34]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[35]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[36]  Ralph Linsker,et al.  Towards an Organizing Principle for a Layered Perceptual Network , 1987, NIPS.

[37]  Neri Merhav,et al.  Relations between entropy and error probability , 1994, IEEE Trans. Inf. Theory.