Classification and feature gene selection using the normalized maximum likelihood model for discrete regression

This paper studies the problem of class discrimination based on the normalized maximum likelihood (NML) model for a nonlinear regression, where the nonlinearly transformed class labels, each taking M possible values, are assumed to be drawn from a multinomial trial process. The strength of the MDL methods in statistical inference is to find the model structure which, in this particular classification problem, amounts to finding the best set of feature genes. We first show that the minimization of the codelength of the NML model for different sets of feature genes is a tractable problem. We then extend the model for selecting the feature genes to a completely defined classifier and check its classification error in a cross-validation experiment. Also the quantization process itself involved in getting the required entries in the model, can be evaluated with the NML description length. The new classification method is applied to leukemia class discrimination based on gene expression microarray data. We find classification errors as low as 0.03% with a quadruplet of binary qnantized genes, which was top ranked by the NML description length. Such a length of the class labels, obtained with various sets of feature genes in the nonlinear regression model, allows intuitive comparisons of nested feature sets.

[1]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[2]  Jorma Rissanen,et al.  Strong optimality of the normalized ML models as universal codes and information in data , 2001, IEEE Trans. Inf. Theory.

[3]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[4]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[7]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[8]  R Kahavi,et al.  Wrapper for feature subset selection , 1997 .

[9]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[10]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[11]  J. Rissanen,et al.  Normalized Maximum Likelihood Models for Boolean Regression with Application to Prediction and Classification in Genomics , 2003 .

[12]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[13]  Jorma Rissanen,et al.  MDL Denoising , 2000, IEEE Trans. Inf. Theory.

[14]  Jaakko Astola,et al.  On the Use of MDL Principle in Gene Expression Prediction , 2001, EURASIP J. Adv. Signal Process..