A Simple Introduction to Maximum Entropy Models for Natural Language Processing

Many problems in natural language processing can be viewed as lin guistic classi cation problems in which linguistic contexts are used to pre dict linguistic classes Maximum entropy models o er a clean way to com bine diverse pieces of contextual evidence in order to estimate the proba bility of a certain linguistic class occurring with a certain linguistic con text This report demonstrates the use of a particular maximum entropy model on an example problem and then proves some relevant mathemat ical facts about the model in a simple and accessible manner This report also describes an existing procedure called Generalized Iterative Scaling which estimates the parameters of this particular model The goal of this report is to provide enough detail to re implement the maximum entropy models described in Ratnaparkhi Reynar and Ratnaparkhi Ratnaparkhi and also to provide a simple explanation of the max imum entropy formalism Introduction Many problems in natural language processing NLP can be re formulated as statistical classi cation problems in which the task is to estimate the probability of class a occurring with context b or p a b Contexts in NLP tasks usually include words and the exact context depends on the nature of the task for some tasks the context b may consist of just a single word while for others b may consist of several words and their associated syntactic labels Large text corpora usually contain some information about the cooccurrence of a s and b s but never enough to completely specify p a b for all possible a b pairs since the words in b are typically sparse The problem is then to nd a method for using the sparse evidence about the a s and b s to reliably estimate a probability model p a b Consider the Principle of Maximum Entropy Jaynes Good which states that the correct distribution p a b is that which maximizes en tropy or uncertainty subject to the constraints which represent evidence i e the facts known to the experimenter Jaynes discusses its advan tages in making inferences on the basis of partial information we must use that probability distribution which has maximum entropy sub ject to whatever is known This is the only unbiased assignment we can make to use any other would amount to arbitrary assumption of information which by hypothesis we do not have More explicitly if A denotes the set of possible classes and B denotes the set of possible contexts p should maximize the entropy H p X