Maximum Entropy Models and Stochastic Optimality Theory

In a series of recent publications (most notably Boersma (1998); see also Boersma and Hayes (2001)), Paul Boersma has developed a stochastic generalization of standard Optimality Theory in the sense of Prince and Smolensky (1993). While a classical OT grammar maps a set of candidates to its optimal element (or elements), in Boersma’s Stochastic Optimality Theory (StOT for short) a grammar defines a probability distribution over such a set. Boersma also developed a natural learning algorithm, the Gradual Learning Algorithm (GLA) that induces a StOT grammar from a corpus. StOT is able to cope with natural language phenomena like ambiguity, optionality, and gradient grammaticality, that are notoriously problematic for standard OT. Keller and Asudeh (2002) raise several criticisms against StOT in general and the GLA in particular. Partially as a reaction to that, Goldwater and Johnson (2003) point out that maximum entropy (ME) models, that are widely used in computational linguistics, might be an alternative to StOT. ME models are similar enough to StOT to make it possible that empirical results reached in the former model can be transferred to the latter, and these models have arguably better formal properties than StOT. On the other hand, the GLA has a higher cognitive plausibility (as can be seen from Boersma and Levelt (2000)) than the standard learning algorithms for ME models. In this paper I will argue that it is possible to combine the advantages of StOT with the ME model. It can be shown that the GLA can be adapted to ME models almost without modifications. Put differently, it turns out that the GLA is the single most natural on-line learning algorithm for ME models. Keller and Asudeh’s criticism, to the degree that it is justified, does not apply to the combination of ME evaluation with GLA learning, and the cognitive advantages of the GLA are maintained.