Data Dependent Risk Bounds for Hierarchical Mixture of Experts Classifiers

The hierarchical mixture of experts architecture provides a flexible procedure for implementing classification algorithms. The classification is obtained by a recursive soft partition of the feature space in a data-driven fashion. Such a procedure enables local classification where several experts are used, each of which is assigned with the task of classification over some subspace of the feature space. In this work, we provide data-dependent generalization error bounds for this class of models, which lead to effective procedures for performing model selection. Tight bounds are particularly important here, because the model is highly parameterized. The theoretical results are complemented with numerical experiments based on a randomized algorithm, which mitigates the effects of local minima which plague other approaches such as the expectation-maximization algorithm.

[1]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[2]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[3]  M. Opper,et al.  Advanced mean field methods: theory and practice , 2001 .

[4]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[5]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[6]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[7]  S. Boucheron,et al.  Concentration inequalities using the entropy method , 2003 .

[8]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[9]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[10]  B. Bollobás Surveys in Combinatorics , 1979 .

[11]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[12]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[13]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[14]  Ron Meir,et al.  Data-Dependent Bounds for Multi-category Classification Based on Convex Losses , 2003, COLT.

[15]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[16]  Geoffrey E. Hinton,et al.  SMEM Algorithm for Mixture Models , 1998, Neural Computation.

[17]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[18]  Bernhard Schölkopf,et al.  Learning Theory and Kernel Machines , 2003, Lecture Notes in Computer Science.

[19]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[20]  Ran El-Yaniv,et al.  Localized Boosting , 2000, COLT.

[21]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[22]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[23]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[24]  P. MassartLedoux,et al.  Concentration Inequalities Using the Entropy Method , 2002 .

[25]  G. Lugosi,et al.  Complexity regularization via localized random penalties , 2004, math/0410091.

[26]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .