Derivations of Normalized Mutual Information in Binary Classifications

Although the conventional performance indexes, such as accuracy, are commonly used in classifier selection or evaluation, information-based criteria, such as mutual information, are becoming popular in feature/model selections. In this work, we analyze the classifier learning model with the maximization normalized mutual information (NI) criterion, which is novel and well defined in a compact range for classifier evaluation. We derive close-form relations of normalized mutual information with respect to accuracy, precision, and recall in binary classifications. By exploring the relations among them, we reveal that NI is actually a set of nonlinear functions, with a concordant power-exponent form, to each performance index. The relations can also be expressed with respect to precision and recall, or to false alarm and hitting rate (recall).

[1]  Walter Zucchini,et al.  Model Selection , 2011, International Encyclopedia of Statistical Science.

[2]  Guillaume Bouchard,et al.  Selection of generative models in classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[4]  Joe Suzuki,et al.  On Strong Consistency of Model Selection in Classification , 2006, IEEE Transactions on Information Theory.

[5]  H. Akaike A new look at the statistical model identification , 1974 .

[6]  John W. Fisher,et al.  Learning from Examples with Information Theoretic Criteria , 2000, J. VLSI Signal Process..

[7]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[8]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[9]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[10]  Niilo Saranummi,et al.  Information technology in biomedicine , 2002, IEEE Trans. Biomed. Eng..

[11]  J. Shore On a relation between maximum likelihood classification and minimum relative-entropy classification , 1984, IEEE Trans. Inf. Theory.

[12]  D. V. Sridhar,et al.  Information theoretic subset selection for neural network models , 1998 .

[13]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[14]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[15]  L. Goddard Information Theory , 1962, Nature.

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  Wang Yong A Study on Integrated Evaluating Kernel Classification Performance Using Statistical Methods , 2008 .

[18]  David G. Stork,et al.  Pattern Classification , 1973 .

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  W. W. Peterson,et al.  The theory of signal detectability , 1954, Trans. IRE Prof. Group Inf. Theory.

[21]  R. F. Wagner,et al.  Assessment of medical imaging systems and computer aids: a tutorial review. , 2007, Academic radiology.

[22]  Rodney W. Johnson,et al.  Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy , 1980, IEEE Trans. Inf. Theory.

[23]  Murray H. Loew,et al.  Assessing Classifiers from Two Independent Data Sets Using ROC Analysis: A Nonparametric Approach , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[25]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[26]  William H. Press,et al.  Numerical recipes in C , 2002 .

[27]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[28]  Paul Lukowicz,et al.  Activity Recognition of Assembly Tasks Using Body-Worn Microphones and Accelerometers , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[31]  Zucchini,et al.  An Introduction to Model Selection. , 2000, Journal of mathematical psychology.

[32]  Julian R. Ullmann,et al.  Discrete Optimization by Relational Constraint Satisfaction , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[34]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[35]  J. Príncipe,et al.  Energy, entropy and information potential for neural computation , 1998 .

[36]  Hayit Greenspan,et al.  Medical Image Categorization and Retrieval for PACS Using the GMM-KL Framework , 2007, IEEE Transactions on Information Technology in Biomedicine.

[37]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[38]  Nicu Sebe,et al.  How to complete performance graphs in content-based image retrieval: add generality and normalize scope , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[40]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.