论文信息 - The unsupervised learning of natural language structure

The unsupervised learning of natural language structure

There is precisely one complete language processing system to date: the human brain. Though there is debate on how much built-in bias human learners might have, we definitely acquire language in a primarily unsupervised fashion. On the other hand, computational approaches to language processing are almost exclusively supervised, relying on hand-labeled corpora for training. This reliance is largely due to unsupervised approaches having repeatedly exhibited discouraging performance. In particular, the problem of learning syntax (grammar) from completely unannotated text has received a great deal of attention for well over a decade, with little in the way of positive results. We argue that previous methods for this task have generally underperformed because of the representations they used. Overly complex models are easily distracted by non-syntactic correlations (such as topical associations), while overly simple models aren't rich enough to capture important first-order properties of language (such as directionality, adjacency, and valence). In this work, we describe several syntactic representations and associated probabilistic models which are designed to capture the basic character of natural language syntax as directly as possible. First, we examine a nested, distributional method which induces bracketed tree structures. Second, we examine a dependency model which induces word-to-word dependency structures. Finally, we demonstrate that these two models perform better in combination than they do alone. With these representations, high-quality analyses can be learned from surprisingly little text, with no labeled examples, in several languages (we show experiments with English, German, and Chinese). Our results show above-baseline performance in unsupervised parsing in each of these languages. Grammar induction methods are useful since parsed corpora exist for only a small number of languages. More generally, most high-level NLP tasks, such as machine translation and question-answering, lack richly annotated corpora, making unsupervised methods extremely appealing even for common languages like English. Finally, while the models in this work are not intended to be cognitively plausible, their effectiveness can inform the investigation of what biases are or are not needed in the human acquisition of language.

Christopher D. Manning | Daniel Klein | D. Klein

[1] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2] Fernando Pereira,et al. Inside-Outside Reestimation From Partially Bracketed Corpora , 1992, HLT.

[3] J. Hayes. Cognition and the development of language , 1970 .

[4] Giorgio Satta,et al. Efficient Parsing for Bilexical Context-Free Grammars and Head Automaton Grammars , 1999, ACL.

[5] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[6] Brian Roark,et al. Probabilistic Top-Down Parsing and Language Modeling , 2001, CL.

[7] Steve Young,et al. Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[8] A Goodness Measure for Phrase Structure Learning via Compression with the MDL Principle , 1998 .

[9] Jason Eisner,et al. Three New Probabilistic Models for Dependency Parsing: An Exploration , 1996, COLING.

[10] J. Fodor. The Modularity of mind. An essay on faculty psychology , 1986 .

[11] Thomas Hofmann,et al. Probabilistic Latent Semantic Analysis , 1999, UAI.