Learning Bias and Phonological-Rule Induction

A fundamental debate in the machine learning of language has been the role of prior knowledge in the learning process. Purely nativist approaches, such as the Principles and Parameters model, build parameterized linguistic generalizations directly into the learning system. Purely empirical approaches use a general, domain-independent learning rule (Error Back-Propagation, Instance-based Generalization, Minimum Description Length) to learn linguistic generalizations directly from the data.In this paper we suggest that an alternative to the purely nativist or purely empiricist learning paradigms is to represent the prior knowledge of language as a set of abstract learning biases, which guide an empirical inductive learning algorithm. We test our idea by examining the machine learning of simple Sound Pattern of English (SPE)-style phonological rules. We represent phonological rules as finite-state transducers that accept underlying forms as input and generate surface forms as output. We show that OSTIA, a general-purpose transducer induction algorithm, was incapable of learning simple phonological rules like flapping. We then augmented OSTIA with three kinds of learning biases that are specific to natural language phonology, and that are assumed explicitly or implicitly by every theory of phonology: faithfulness (underlying segments tend to be realized similarly on the surface), community (similar segments behave similarly), and context (phonological rules need access to variable in their context). These biases are so fundamental to generative phonology that they are left implicit in many theories. But explicitly modifying the OSTIA algorithm with these biases allowed it to learn more compact, accurate, and general transducers, and our implementation successfully learns a number of rules from English and German. Furthermore, we show that some of the remaining errors in our augmented model are due to implicit biases in the traditional SPE-style rewrite system that are not similarly represented in the transducer formalism, suggesting that while transducers may be formally equivalent to SPE-style rules, they may not have identical evaluation procedures.Because our biases were applied to the learning of very simple SPE-style rules, and to a non-psychologically-motivated and nonprobabilistic theory of purely deterministic transducers, we do not expect that our model as implemented has any practical use as a phonological learning device, nor is it intended as a cognitive model of human learning. Indeed, because of the noise and nondeterminism inherent to linguistic data, we feel strongly that stochastic algorithms for language induction are much more likely to be a fruitful research direction. Our model is rather intended to suggest the kind of biases that may be added to other empiricist induction models, and the way in which they may be added, in order to build a cognitively and computationally plausible learning model for phonological rules.

[1]  Noam Chomsky,et al.  Lectures on Government and Binding , 1981 .

[2]  Andreas Stolcke,et al.  Multiple-pronunciation lexical modeling in a speaker independent speech understanding system , 1994, ICSLP.

[3]  X. LingCharles Learning the past tense of English verbs , 1994 .

[4]  Walter Daelemans,et al.  The Acquisition of Stress: A Data-Oriented Approach , 1994, Comput. Linguistics.

[5]  C. Douglas Johnson,et al.  Formal Aspects of Phonological Description , 1972 .

[6]  J. Kupiec Hidden Markov estimation for unrestricted stochastic context-free grammars , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[8]  Steven Bird,et al.  One-Level Phonology: Autosegmental Representations and Rules as Finite Automata , 1994, Comput. Linguistics.

[9]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Ronitt Rubinfeld,et al.  Efficient learning of typical finite automata from random walks , 1993, STOC.

[11]  B. Dresher,et al.  A computational learning model for metrical phonology , 1990, Cognition.

[12]  Stephen G. Pulman,et al.  A feature-based formalism for two-level phonology: a description and implementation , 1993, Comput. Speech Lang..

[13]  T. Mark Ellison,et al.  Phonological Derivation in Optimality Theory , 1994, COLING.

[14]  P. D. Eimas,et al.  Speech Perception in Infants , 1971, Science.

[15]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[16]  Paul Smolensky,et al.  Optimality Theory: Constraint Interaction in Generative Grammar ; CU-CS-696-93 , 1993 .

[17]  L. Karttunen Finite-state Constraints , 1993 .

[18]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[19]  GildeaDaniel,et al.  Learning bias and phonological-rule induction , 1996 .

[20]  Jean Berstel,et al.  Transductions and context-free languages , 1979, Teubner Studienbücher : Informatik.

[21]  Mark Johnson,et al.  A Discovery Procedure for Certain Phonological Rules , 1984, ACL.

[22]  Michael Gasser,et al.  Learning Words in Time: Towards a Modular Connectionist Account of the Acquisition of Receptive Morphology , 1993 .

[23]  C. Pollard,et al.  Center for the Study of Language and Information , 2022 .

[24]  R. Jakobson Child Language, Aphasia and Phonological Universals , 1980 .

[25]  Enrique Vidal,et al.  Learning Subsequential Transducers for Pattern Recognition Interpretation Tasks , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  J. McCarthy Feature Geometry and Dependency: A Review , 1988 .

[27]  David S. Touretzky,et al.  A Computational Basis for Phonology , 1989, NIPS.

[28]  Jaye Padgett,et al.  Feature Classes* , 1995 .

[29]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[30]  P. Smolensky,et al.  The Learnability of Optimality Theory: An Algorithm and Some Basic Complexity Results , 1995 .

[31]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[32]  Judith Markowitz,et al.  Review of Computational models of American speech by M. Margaret Withgott and Francine R. Chen. Center for the Study of Language and Information 1993. , 1994 .

[33]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[34]  David S. Touretzky,et al.  Connectionist Models and Linguistic Theory: Investigations of Stress Systems in Language , 1993, Cogn. Sci..

[35]  Mark Aronoff,et al.  Word Formation in Generative Grammar , 1979 .

[36]  Helmut Lucke,et al.  Inference of stochastic context-free grammar rules from example data using the theory of Bayesian belief propagation , 1993, EUROSPEECH.

[37]  Francine R. Chen,et al.  Computational Models of American Speech , 1992 .

[38]  David S. Touretzky,et al.  Phonological Rule Induction: An Architectural Solution , 1990 .

[39]  Charles X. Ling,et al.  Learning the Past Tense of English Verbs: The Symbolic Pattern Associator vs. Connectionist Models , 1993, J. Artif. Intell. Res..

[40]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[41]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[42]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[43]  Steven Bird,et al.  Computational phonology: A constraint-based approach , 1995, CL.

[44]  Bruce Tesar,et al.  Computational optimality theory , 1996 .

[45]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .