How can we apply neural network and machine learning methodologies to natural language processing? In this paper we consider the task of training a neural network to classify natural language sentences as grammatical or ungrammatical thereby exhibiting the same kind of discriminatory power provided by the Principles and Parameters linguistic framework, or Government-and-Binding theory. We have investigated the following models: feed-forward neural networks, Frasconi-Gori-Soda and Back-Tsoi locally recurrent neural networks, Williams and Zipser and Elman recurrent neural networks, Euclidean and editdistance nearest-neighbors, simulated annealing, and decision trees. Non-neural network machine learning methods are included primarily for comparison. Initial simulations were only partially successful by using a large temporal window as input to the models. Investigation indicated that success obtained this way did not imply that the models had learnt the grammar to a significant degree. Attempts to train networks with small temporal windows failed until we implemented several techniques aimed at avoiding local minima. We discuss the strengths and weaknesses of learning as compared to manual encoding, and we consider the similarities and differences between the various neural network and machine learning approaches. Also with Electrical and Computer Engineering, University of Queensland, St. Lucia Qld 4072, Australia. y Also with the Institute for Advanced Computer Studies, Univ ersity of Maryland, College Park, MD 20742. 1 Motivation 1.1 Language and Its Acquisition Certainly one of the most important questions for the study of human language is: How do people unfailingly manage to acquire such a complex rule system? A system so complex that it has resisted the efforts of linguists to date to adequately describe in a formal system (Chomsky 1986)? Here, we will provide a couple of examples of the kind of knowledge native speakers often take for granted. For instance, any native speaker of English knows that the adjectiveeagerobligatorily takes a complementizer for with a sentential complement that contains an overt subject, but that the verbelievecannot. Moreover, eager may take a sentential complement with a non-overt, i.e. an implied or understood, subject, but believecannot: 1 *I am eager John to be here I believe John to be here I am eager for John to be here *I believe for John to be here I am eager to be here *I believe to be here Such grammaticality judgments are sometimes subtle but unarguably form part of the native speaker’s language competence. In other cases, judgment falls not on acceptability but on other aspects of language competence such as interpretation. Consider the reference of the embedded subject of the predicate to talk to in the following examples: John is too stubborn for Mary to talk to John is too stubborn to talk to John is too stubborn to talk to Bill 1As is conventional, we use the asterisk to indicate ungramma ticality in these examples. In the first sentence, it is clear that Mary is the subject of the embedded predicate. As every native speaker knows, there is a strong contrast in the co-reference options for the understood subject in the second and third sentences despite their surface similarity. In the third sentence, John must be the implied subject of the predicate to talk to. By contrast,Johnis understood as the object of the predicate in the second sentence, the subject here having arbitrary reference; in other words, the sentence can be read as John is too stubborn for some arbitrary person to talk to John . The point we would like to emphasize here is that the language faculty has impressive discriminatory power, in the sense that a single word, as seen in the examples above, can result in sharp differences in acceptability or alter th interpretation of a sentence considerably. Furthermore, the judgments shown above are robust in the sense that virtually all native speakers will agree with the data. In the light of such examples and the fact that such contrasts crop up not just in English but in other languages (for example, thestubborncontrast also holds in Dutch), some linguists (chiefly Chomsky (Chomsky 1981)) have hypothesized that it is only reasonable that such knowledge is only partially acquired: the lack of variation found across speakers, and indeed, languages for certain classes of data suggests that there exists a fixed component of the language system. In other words, there is an innate component of the language faculty of the human mind that governs language processing. All languages obey these so-called universal principles. Since languages do differ with regard to things like subjectobject-verb order, these principles are subject to parameters encoding systematic variations found in particular languages. Under the innateness hypothesis, only the language parameters plus the language-specific lexicon are acquired by the speaker; in particular, the principles are not learned. Based on these assumptions, the study of these language-independent principles has become known as the Principles-and-Parameters framework, or Government-and-Binding (GB) theory. In this paper, we ask the question: Can a neural network be made to exhibit the same kind of discriminatory power on the data GB-linguists have examined? More precisely, the goal of the experiment is to train a neural net from scratch, i.e. without the bifurcation into learned vs. innate components assumed by Chomsky, to produce the same judgments as native speakers on the sharply grammatical/ungrammatical pairs of the sort discussed above. 1.2 Representational Power The most successful stochastic language models have been based on finite-state descriptions such as n-grams or hidden Markov models. However, finite-state models cannot represent hierarchical structures as found in natural language2 (Pereira 1992). In the past few years several recurrent neural network architectures have emerged which have been used for grammatical inference (Cleeremans, Servan-Schreiber & McClelland 1989, Giles, Sun, Chen, Lee & Chen 1990, Giles, Chen, Miller, Chen, Sun & Lee 1991, Giles, Miller, Chen, Chen, Sun & Lee 1992, Giles, Miller, Chen, Sun, Chen & Lee 1992). Do neural networks posses the power required for the task at hand? Yes, it has been shown that recurrent networks have the representational power required for hierarchical solutions (Elman 1991), and that they are Turing equivalent (Siegelmann & Sontag 1992). However, only recently has any work been successful with moderately large grammars. Recurrent neural networks have been used for several small natural language problems, e.g. papers using the Elman network for natural language tasks include: (Stolcke 1990, Allen 1983, Elman 1984, Harris & Elman 1984, John & McLelland 1990).
[1]
Anders Krogh,et al.
Introduction to the theory of neural computation
,
1994,
The advanced book program.
[2]
Philip J. Stone,et al.
Experiments in induction
,
1966
.
[3]
Noam Chomsky,et al.
Lectures on Government and Binding
,
1981
.
[4]
Jorma Rissanen,et al.
Stochastic Complexity in Statistical Inquiry
,
1989,
World Scientific Series in Computer Science.
[5]
Alberto Maria Segre,et al.
Programs for Machine Learning
,
1994
.
[6]
James P. Crutchfield,et al.
Computation at the Onset of Chaos
,
1991
.
[7]
Esther Levin,et al.
Accelerated Learning in Layered Neural Networks
,
1988,
Complex Syst..
[8]
Raymond L. Watrous,et al.
Induction of Finite-State Languages Using Second-Order Recurrent Networks
,
1992,
Neural Computation.
[9]
Geoffrey E. Hinton.
Learning and Applying Contextual Constraints in Sentence Comprehension
,
1991
.
[10]
Jeffrey L. Elman,et al.
Finding Structure in Time
,
1990,
Cogn. Sci..
[11]
Michael A. Arbib,et al.
An Introduction to Formal Language Theory
,
1988,
Texts and Monographs in Computer Science.
[12]
King-Sun Fu,et al.
Syntactic Pattern Recognition And Applications
,
1968
.
[13]
Jing Peng,et al.
An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories
,
1990,
Neural Computation.
[14]
Hava T. Siegelmann,et al.
On the Computational Power of Neural Nets
,
1995,
J. Comput. Syst. Sci..
[15]
Ronald J. Williams,et al.
A Learning Algorithm for Continually Running Fully Recurrent Neural Networks
,
1989,
Neural Computation.
[16]
James L. McClelland,et al.
Learning and Applying Contextual Constraints in Sentence Comprehension
,
1990,
Artif. Intell..
[17]
David Pesetsky,et al.
Paths and categories
,
1982
.
[18]
Noam Chomsky.
Knowledge of language: its nature, origin, and use
,
1988
.
[19]
Andreas Stolcke.
Learning Feature-based Semantics with Simple Recurrent Networks
,
1990
.
[20]
J. Ross Quinlan,et al.
C4.5: Programs for Machine Learning
,
1992
.
[21]
Patrice Y. Simard,et al.
Analysis of Recurrent Backpropagation
,
1988
.
[22]
S. Hyakin,et al.
Neural Networks: A Comprehensive Foundation
,
1994
.
[23]
M. Inés Torres,et al.
Pattern recognition and applications
,
2000
.
[24]
Hava T. Siegelmann,et al.
The complexity of language recognition by neural networks
,
1992,
Neurocomputing.
[25]
金田 重郎,et al.
C4.5: Programs for Machine Learning (書評)
,
1995
.
[26]
C. Lee Giles,et al.
Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks
,
1992,
Neural Computation.
[27]
Juan Uriagereka,et al.
A Course in GB Syntax: Lectures on Binding and Empty Categories
,
1988
.
[28]
Mary Hare,et al.
The Role of Similarity in Hungarian Vowel Harmony: a Connectionist Account
,
1990
.
[29]
James L. McClelland,et al.
James L. McClelland, David Rumelhart and the PDP Research Group, Parallel distributed processing: explorations in the microstructure of cognition . Vol. 1. Foundations . Vol. 2. Psychological and biological models . Cambridge MA: M.I.T. Press, 1987.
,
1989,
Journal of Child Language.
[30]
L. Ingber.
Very fast simulated re-annealing
,
1989
.
[31]
C. Lee Giles,et al.
Higher Order Recurrent Networks and Grammatical Inference
,
1989,
NIPS.
[32]
David Sankoff,et al.
Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison
,
1983
.
[33]
L. Ingber.
Adaptive Simulated Annealing (ASA)
,
1993
.
[34]
Kumpati S. Narendra,et al.
Identification and control of dynamical systems using neural networks
,
1990,
IEEE Trans. Neural Networks.
[35]
Geoffrey E. Hinton,et al.
Distributed Representations
,
1986,
The Philosophy of Artificial Intelligence.
[36]
Padhraic Smyth,et al.
Learning Finite State Machines With Self-Clustering Recurrent Networks
,
1993,
Neural Computation.
[37]
Aravind K. Joshi,et al.
Natural language parsing: Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions?
,
1985
.
[38]
Giovanni Soda,et al.
Local Feedback Multilayered Networks
,
1992,
Neural Computation.
[39]
C. L. Giles,et al.
Second-order recurrent neural networks for grammatical inference
,
1991,
IJCNN-91-Seattle International Joint Conference on Neural Networks.
[40]
Eric B. Baum,et al.
Supervised Learning of Probability Distributions by Neural Networks
,
1987,
NIPS.
[41]
John E. Moody,et al.
Note on Learning Rate Schedules for Stochastic Optimization
,
1990,
NIPS.
[42]
J J Hopfield,et al.
Learning algorithms and probability distributions in feed-forward and feed-back networks.
,
1987,
Proceedings of the National Academy of Sciences of the United States of America.
[43]
Garrison W. Cottrell,et al.
A Connectionist Perspective on Prosodic Structure
,
1989
.
[44]
Ah Chung Tsoi,et al.
FIR and IIR Synapses, a New Neural Network Architecture for Time Series Modeling
,
1991,
Neural Computation.
[45]
G. Kane.
Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1: Foundations, vol 2: Psychological and Biological Models
,
1994
.
[46]
James L. McClelland,et al.
Finite State Automata and Simple Recurrent Networks
,
1989,
Neural Computation.
[47]
Jeffrey L. Elman,et al.
Distributed Representations, Simple Recurrent Networks, and Grammatical Structure
,
1991,
Mach. Learn..
[48]
C. Lee Giles,et al.
An experimental comparison of recurrent neural networks
,
1994,
NIPS.
[49]
Noam Chomsky.
Knowledge of Language
,
1986
.
[50]
C. Lee Giles,et al.
Extracting and Learning an Unknown Grammar with Recurrent Neural Networks
,
1991,
NIPS.
[51]
John E. Moody,et al.
Towards Faster Stochastic Gradient Search
,
1991,
NIPS.
[52]
R. Taraban,et al.
Language learning: Cues or rules?
,
1989
.
[53]
Ronald J. Williams,et al.
Gradient-based learning algorithms for recurrent connectionist networks
,
1990
.
[54]
J. Kruskal.
An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules
,
1983
.
[55]
Fernando Pereira,et al.
Inside-Outside Reestimation From Partially Bracketed Corpora
,
1992,
HLT.
[56]
Noam Chomsky,et al.
Three models for the description of language
,
1956,
IRE Trans. Inf. Theory.
[57]
Solomon Kullback,et al.
Information Theory and Statistics
,
1960
.
[58]
David S. Touretzky.
Rules and Maps in Connectionist Symbol Processing
,
1989
.
[59]
Andrew D. Back.
New techniques for nonlinear system identification : a rapprochement between neural networks and linear systems
,
1992
.