Rich Syntax from a Raw Corpus: Unsupervised Does It

We compare our model of unsupervised learning of linguistic structures, ADIOS [1], to some recent work in computational linguistics and in grammar theory. Our approach resembles the Construction Grammar in its general philosophy (e.g., in its reliance on structural generalizations rather than on syntax projected by the lexicon, as in the current generative theories), and the Tree Adjoining Grammar in its computational characteristics (e.g., in its apparent affinity with Mildly Context Sensitive Languages). The representations learned by our algorithm are truly emergent from the (unannotated) corpus data, whereas those found in published works on cognitive and construction grammars and on TAGs are hand-tailored. Thus, our results complement and extend both the computational and the more linguistically oriented research into language acquisition. We conclude by suggesting how empirical and formal study of language can be best integrated. 1 Unsupervised learning through redundancy reduction Reduction of redundancy is a general (and arguably the only conceivable) approach to unsupervised learning [2, 3]. Written natural language (or transcribed speech) is trivially redundant to the extent it relies on a fixed lexicon. This property of language makes possible the unsupervised recovery of words from a text corpus with all the spaces omitted, through a straightforward minimization of per-letter entropy [4]. Pushing entropy minimization to the limit would lead to an absurd situation in which the agglomeration of words into successively longer “primitive” sequences renders the resulting representation useless for dealing with novel texts (that is, incapable of generalization; cf. [5], p.188). We observe, however, that a word-based representation is still redundant to the extent that different sentences share the same word sequences. Such sequences need not be contiguous; indeed, the detection of paradigmatic variation within a slot in a set of otherwise identical aligned sequences (syntagms) is the basis for the classical distributional theory of language [6], as well as for some modern NLP methods [7]. The pattern — the syntagm and the equivalence class of complementary-distribution symbols that may appear in its open slot — is the main representational building block of our system, ADIOS (for Automatic DIstillation Of Structure) [1].1 Our goal here is to help bridge statistical and formal approaches to language [9] by placing our work on the unsupervised learning of structure in the context of current research in grammar acquisition in computational linguistics, and at the same time to link it to certain formal theories of grammar. Section 2 outlines the main computational principles behind the ADIOS model (for algorithmic details and empirical results, see [1, 10]). Sections 3 and 4 compare our model to select approaches from computational and formal linguistics, respectively. We conclude with a focus on the challenges ahead, discussed in section 5. 2 The principles behind the ADIOS algorithm The representational power of ADIOS and its capacity for unsupervised learning rest on three principles: (1) probabilistic inference of pattern significance, (2) context-sensitive generalization, and (3) recursive construction of complex patterns. Each of these is described briefly below.

[1]  Peter Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[2]  Eytan Ruppin,et al.  Unsupervised Efficient Learning and Representation of Language Structure , 2003 .

[3]  C Snow,et al.  Child language data exchange system , 1984, Journal of Child Language.

[4]  Khalil Sima'an,et al.  A memory-based model of syntactic analysis: data-oriented parsing , 1999, J. Exp. Theor. Artif. Intell..

[5]  R. Jackendoff Foundations of Language: Brain, Meaning, Grammar, Evolution , 2002 .

[6]  Aravind K. Joshi,et al.  Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[7]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[8]  Menno van Zaanen ABL: Alignment-Based Learning , 2000, COLING.

[9]  Dan Klein,et al.  Natural Language Grammar Induction Using a Constituent-Context Model , 2001, NIPS.

[10]  Alexander Clark Unsupervised induction of stochastic context-free grammars using distributional clustering , 2001, CoNLL.

[11]  R. Langacker Foundations of cognitive grammar , 1983 .

[12]  Rens Bod,et al.  Beyond Grammar: An Experience-Based Theory of Language , 1998 .

[13]  Andreas Stolcke,et al.  Inducing Probabilistic Grammars by Bayesian Model Merging , 1994, ICGI.

[14]  Adele E. Goldberg Constructions: a new theoretical approach to language , 2003, Trends in Cognitive Sciences.

[15]  Horace Barlow,et al.  What is the computational goal of the neocortex , 1994 .

[16]  A. Norman Redlich,et al.  Redundancy Reduction as a Strategy for Unsupervised Learning , 1993, Neural Computation.

[17]  G. Miller,et al.  Cognitive science. , 1981, Science.

[18]  M. Gross The Construction of Local Grammars , 1997 .

[19]  Paul M. Pietroski,et al.  The Character of Natural Language Semantics , 2002 .

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Fernando C Pereira Formal grammar and information theory: together again? , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[22]  William Croft,et al.  Radical Construction Grammar: Syntactic Theory in Typological Perspective , 2001 .

[23]  J. Widdicombe Sensory mechanisms. , 1996, Pulmonary pharmacology.

[24]  Alexander Clark,et al.  Unsupervised Language Acquisition: Theory and Practice , 2002, ArXiv.

[25]  J. Wolff Learning Syntax and Meanings Through Optimization and Distributional Analysis , 1988 .

[26]  C. Fillmore,et al.  Grammatical constructions and linguistic generalizations: The What's X doing Y? construction , 1999 .

[27]  R. Langacker Foundations of Cognitive Grammar: Volume I: Theoretical Prerequisites , 1987 .

[28]  Eytan Ruppin,et al.  Automatic Acquisition and Efficient Representation of Syntactic Structures , 2002, NIPS.

[29]  Z. Harris,et al.  Foundations of language , 1941 .