On Strong Consistency of Model Selection in Classification

This paper considers model selection in classification. In many applications such as pattern recognition, probabilistic inference using a Bayesian network, prediction of the next in a sequence based on a Markov chain, the conditional probability P(Y=y|X=x) of class yisinY given attribute value xisinX is utilized. By model we mean the equivalence relation in X: for x,x'isinXx~x'hArrP(Y=y|X=x)=P(Y=y|X=x'), forall yisinY. By classification we mean the number of such equivalence classes is finite. We estimate the model from n samples z<sup>n</sup>=(x<sub>i</sub>,y<sub>i</sub>)<sub>i=1</sub> <sup>n </sup>isin(XtimesY)<sup>n</sup>, using information criteria in the form empirical entropy H plus penalty term (k/2)d<sub>n</sub> (the model such that H+(k/2)d<sub>n</sub> is minimized is the estimated model), where k is the number of independent parameters in the model, and {d<sub>n</sub>}<sub>n=1</sub> <sup>infin</sup> is a real nonnegative sequence such that lim sup<sub>n</sub>d<sub>n</sub>/n=0. For autoregressive processes, although the definitions of H and k are different, it is known that the estimated model almost surely coincides with the true model as nrarrinfin if {d<sub>n</sub>}<sub>n=1</sub> <sup>infin</sup>>{2loglogn}<sub>n=1</sub> <sup>infin</sup>, and that it does not if {d<sub>n</sub>}<sub>n=1</sub> <sup>infin</sup><{2loglogn}<sub>n=1 </sub> <sup>infin</sup> (Hannan and Quinn). The problem whether the same property is true for classification was open. This paper solves the problem in the affirmative

[1]  I. Csiszár,et al.  The consistency of the BIC Markov order estimator , 2000 .

[2]  B. G. Quinn,et al.  The determination of the order of an autoregression , 1979 .

[3]  H. Cramér Mathematical methods of statistics , 1947 .

[4]  Neri Merhav,et al.  The estimation of the model order in exponential families , 1989, IEEE Trans. Inf. Theory.

[5]  W. Stout Almost sure convergence , 1974 .

[6]  Anthony C. Atkinson,et al.  A Method for Discriminating between Models , 1970 .

[7]  Neri Merhav,et al.  On the estimation of the order of a Markov chain and universal data compression , 1989, IEEE Trans. Inf. Theory.

[8]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[9]  H. Akaike A new look at the statistical model identification , 1974 .

[10]  I. Csiszar,et al.  The consistency of the BIC Markov order estimator , 2000, 2000 IEEE International Symposium on Information Theory (Cat. No.00CH37060).

[11]  T. Speed,et al.  Data compression and histograms , 1992 .

[12]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[13]  John C. Kieffer,et al.  Strongly consistent code-based identification and order estimation for constrained finite-state model classes , 1993, IEEE Trans. Inf. Theory.

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Joe Suzuki,et al.  A Construction of Bayesian Networks from Databases Based on an MDL Principle , 1993, UAI.

[16]  Prakash Narayan,et al.  Order estimation and sequential universal data compression of a hidden Markov source by the method of mixtures , 1994, IEEE Trans. Inf. Theory.

[17]  E. Hannan The Estimation of the Order of an ARMA Process , 1980 .

[18]  Patrick Billingsley,et al.  Statistical inference for Markov processes , 1961 .

[19]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[20]  E. Hannan,et al.  The determination of optimum structures for the state space representation of multivariate stochastic processes , 1982 .

[21]  R. Shibata Selection of the order of an autoregressive model by Akaike's information criterion , 1976 .

[22]  Neri Merhav,et al.  Estimating the number of states of a finite-state source , 1992, IEEE Trans. Inf. Theory.

[23]  L. Finesso Consistent estimation of the order for Markov and hidden Markov chains , 1992 .

[24]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[25]  E. Hannan,et al.  On stochastic complexity and nonparametric density estimation , 1988 .