Multiplicative Updates for Classification by Mixture Models

We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM.

[1]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[4]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[5]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[6]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[7]  Harris Drucker,et al.  Comparison of learning algorithms for handwritten digit recognition , 1995 .

[8]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[9]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  J. O’Sullivan Alternating Minimization Algorithms: From Blahut-Arimoto to Expectation-Maximization , 1998 .

[11]  Alexander Vardy Codes, Curves, and Signals: Common Threads in Communications , 1998 .

[12]  Alex Pentland,et al.  Maximum Conditional Likelihood via Bound Maximization and the CEM Algorithm , 1998, NIPS.

[13]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.