Discriminative Learning via Semidefinite Probabilistic Models

Discriminative linear models are a popular tool in machine learning. These can be generally divided into two types: The first is linear classifiers, such as support vector machines, which are well studied and provide state-of-the-art results. One shortcoming of these models is that their output (known as the 'margin') is not calibrated, and cannot be translated naturally into a distribution over the labels. Thus, it is difficult to incorporate such models as components of larger systems, unlike probabilistic based approaches. The second type of approach constructs class conditional distributions using a nonlinearity (e.g. log-linear models), but is occasionally worse in terms of classification error. We propose a supervised learning method which combines the best of both approaches. Specifically, our method provides a distribution over the labels, which is a linear function of the model parameters. As a consequence, differences between probabilities are linear functions, a property which most probabilistic models (e.g. log-linear) do not have. Our model assumes that classes correspond to linear subspaces (rather than to half spaces). Using a relaxed projection operator, we construct a measure which evaluates the degree to which a given vector 'belongs' to a subspace, resulting in a distribution over labels. Interestingly, this view is closely related to similar concepts in quantum detection theory. The resulting models can be trained either to maximize the margin or to optimize average likelihood measures. The corresponding optimization problems are semidefinite programs which can be solved efficiently. We illustrate the performance of our algorithm on real world datasets, and show that it outperforms 2nd order kernel methods.

[1]  Stephen P. Boyd,et al.  Semidefinite Programming , 1996, SIAM Rev..

[2]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[3]  H. Yuen Quantum detection and estimation theory , 1978, Proceedings of the IEEE.

[4]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[5]  Claudio Gentile,et al.  Margin-Based Algorithms for Information Filtering , 2002, NIPS.

[6]  R. Dykstra An Algorithm for Restricted Least Squares Regression , 1983 .

[7]  Claudio Gentile,et al.  Learning Probabilistic Linear-Threshold Classifiers via Selective Sampling , 2003, COLT.

[8]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[9]  Yonina C. Eldar Quantum signal processing , 2002, IEEE Signal Process. Mag..

[10]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[11]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[12]  Manfred K. Warmuth A Bayes Rule for Density Matrices , 2005, NIPS.

[13]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  B. Borchers CSDP, A C library for semidefinite programming , 1999 .

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  L. Wolf,et al.  Learning using the Born Rule , 2006 .

[17]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[18]  Alexander J. Smola,et al.  Learning with kernels , 1998 .