An experimental and theoretical comparison of model selection methods

We investigate the problem of {\it model\ selection} in the setting of supervised learning of boolean functions from independent random examples. More precisely, we compare methods for finding a balance between the complexity of the hypothesis chosen and its observed error on a random training sample of limited size, when the goal is that of minimizing the resulting generalization error. We undertake a detailed comparison of three well-known model selection methods — a variation of Vapnik‘s {\it Guaranteed\ Risk\ Minimization} (GRM), an instance of Rissanen‘s {\it Minimum\ Description\ Length\ Principle} (MDL), and (hold-out) cross validation (CV). We introduce a general class of model selection methods (called {\it penalty-based} methods) that includes both GRM and MDL, and provide general methods for analyzing such rules. We provide both controlled experimental evidence and formal theorems to support the following conclusions: \bulletEven on simple model selection problems, the behavior of the methods examined can be both complex and incomparable. Furthermore, no amount of “tuning” of the rules investigated (such as introducing constant multipliers on the complexity penalty terms, or a distribution-specific “effective dimension”) can eliminate this incomparability. \bulletIt is possible to give rather general bounds on the generalization error, as a function of sample size, for penalty-based methods. The quality of such bounds depends in a precise way on the extent to which the method considered automatically limits the complexity of the hypothesis selected. \bulletFor {\it any} model selection problem, the additional error of cross validation compared to {\it any} other method can be bounded above by the sum of two terms. The first term is large only if the learning curve of the underlying function classes experiences a phase transition” between (1-\gamma)m andm examples (where \gamma is the fraction saved for testing in CV). The second and competing term can be made arbitrarily small by increasing\gamma . \bulletThe class of penalty-based methods is fundamentally handicapped in the sense that there exist two types of model selection problems for which every penalty-based method must incur large generalization error on at least one, while CV enjoys small generalization error on both.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[4]  M. Stone Asymptotics for and against cross-validation , 1977 .

[5]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[6]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[7]  Leslie G. Valiant,et al.  Computational limitations on learning from examples , 1988, JACM.

[8]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[11]  David H. Wolpert,et al.  On the Connection between In-sample Testing and Generalization Error , 1992, Complex Syst..

[12]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[13]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[14]  Cullen Schaffer,et al.  A Conservation Law for Generalization Performance , 1994, ICML.

[15]  David Haussler,et al.  Rigorous Learning Curve Bounds from Statistical Mechanics , 1994, COLT.

[16]  Michael Kearns,et al.  A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split , 1995, Neural Computation.