Introduction to the special section on linguistically apt statistical methods

In 1994—about six years after it was first infiltrated by statistical methods—the Association for Computational Li nguistics hosted a workshop called “The Balancing Act: Combining Symbolic and Statistical Approaches to Language” (Klavans & Resnik, 1996). The workshop argued that linguistics and statistics were not fundamentally at odds, eve n though the recent well-known statistical techniques for pa rtof-speech disambiguation (Church, 1988; DeRose, 1988) had, like their predecessors in speech recognition, flouted Chomsky’s (1957) warnings that Markov or n-gram models were inadequate to model language. The success of these Markovian techniques had merely established that empirically estimated probabilities could be rather effective ev n with an impoverished theory of linguistic structure. As an engineering matter, the workshop argued, it was wise to incorporate probabilities or other numbers into any linguistic approach. Several years later, it seems worth taking another snapshot from this perspective. It is fair to say that a greater propor tion of hybrid approaches to language now are cleanly structured rather than cobbled together, and that the benefits to both sides of such approaches are better understood. The prevalent methodology is to design the form of one’s statistical model so that it is capable of expressing the kinds of linguistic generalizations that one cares about, and then to se t the free parameters of this model so that its predicted behav ior roughly matches the observed behavior of some training data. The reason that one augments a symbolic generative grammar with probabilities is to make it more robust to noise and ambiguity. 1 After all, statistics is the art of plausibly reconstructing the unknown, which is exactly what language comprehension and learning require. Conversely, one constrains a probability model with grammar to make it more robust to poverty of the stimulus. After all, from sparse data a statistician cannot hope to estimate a separate probability for every string of the language. All t hat is practical is to estimate a moderate set of parameters that encode high-level properties from which the behavior of the entire language emerges. Carrying out this program is not trivial in practice. Patterning a statistical model after a linguistic theory may re quire some rethinking of the theory, especially if the model is to be elegant and computationally tractable. And there is more than one way to do it: the first few tries at adding linguistic sophistication often hurt a system’s accuracy rath er than helping it. More complex linguistic representations a lso call for more complex, slower, and/or more approximate algorithms to estimate the parameters of the statistical mode l. Nonetheless, the paradigm has enabled progress in many areas of linguistics, speech processing, and natural langu age processing. The present special section of brief reports sp ans diverse interests: Johnson and Riezler show how a model of the relative probabilities of parse trees can be made sensitive to any linguistic feature one might care to specify. They report that this approach can be applied to the tricky case of Lexical-Functional Grammar (LFG). Eisner explains how to attach probabilities to lexicalized grammars, including the lexical redundancy rules that express transformational generalizations in the grammar. The model is designed so that learners are naturally inclined to discover and use such generalizations. Light and Greiff review several published techniques for discovering lexical selectional preferences. In these techniques, the models are constrained not by just by the abstract theory of a taxonomy of meaning, but by the particular taxonomy of the WordNet lexical database. Nock and Young report on speech modeling techniques inspired by the fact that speech is produced not by a monolithic mouth, but by a system of articulators (tongue root, lips, etc.) that act somewhat independently of one another.