Bayesian Multi-Hyperplane Machine for Pattern Recognition

Current existing multi-hyperplane machine approach deals with high-dimensional and complex datasets by approximating the input data region using a parametric mixture of hyperplanes. Consequently, this approach requires an excessively time-consuming parameter search to find the set of optimal hyper-parameters. Another serious drawback of this approach is that it is often suboptimal since the optimal choice for the hyper-parameter is likely to lie outside the searching space due to the space discretization step required in grid search. To address these challenges, we propose in this paper BAyesian Multi-hyperplane Machine (BAMM). Our approach departs from a Bayesian perspective, and aims to construct an alternative probabilistic view in such a way that its maximum-a-posteriori (MAP) estimation reduces exactly to the original optimization problem of a multi-hyperplane machine. This view allows us to endow prior distributions over hyper-parameters and augment auxiliary variables to efficiently infer model parameters and hyper-parameters via Markov chain Monte Carlo (MCMC) method. We then employ a Stochastic Gradient Descent (SGD) framework to scale our model up with ever-growing large datasets. Extensive experiments demonstrate the capability of our proposed method in learning the optimal model without using any parameter tuning, and in achieving comparable accuracies compared with the state-of-art baselines; in the meantime our model can seamlessly handle with large-scale datasets.

[1]  Luc Devroye Continuous Univariate Densities , 1986 .

[2]  Trung Le,et al.  One-Pass Logistic Regression for Label-Drift and Large-Scale Classification on Distributed Systems , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[3]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Koby Crammer,et al.  Trading representability for scalability: adaptive multi-hyperplane machine for nonlinear classification , 2011, KDD.

[6]  Alessandro Sperduti,et al.  Multiclass Classification with Multi-Prototype Support Vector Machines , 2005, J. Mach. Learn. Res..

[7]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[8]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[9]  Ning Chen,et al.  Infinite SVM: a Dirichlet Process Mixture of Large-margin Kernel Machines , 2011, ICML.

[10]  Jun Zhu,et al.  Small-Variance Asymptotics for Dirichlet Process Mixtures of SVMs , 2014, AAAI.

[11]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[12]  Y. Singer,et al.  Logarithmic Regret Algorithms for Strongly Convex Repeated Games , 2007 .

[13]  Trung Le,et al.  Sparse Adaptive Multi-hyperplane Machine , 2016, PAKDD.

[14]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[15]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  Chih-Jen Lin,et al.  A sequential dual method for large scale multi-class linear svms , 2008, KDD.

[18]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[19]  Nicholas G. Polson,et al.  Data augmentation for support vector machines , 2011 .

[20]  D. F. Andrews,et al.  Scale Mixtures of Normal Distributions , 1974 .