Evaluating Models of Syntactic Category Acquisition without Using a Gold Standard Stella Frank (s.c.frank@sms.ed.ac.uk) and Sharon Goldwater (sgwater@inf.ed.ac.uk) and Frank Keller (keller@inf.ed.ac.uk) School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, UK Abstract A number of different measures have been proposed for eval- uating computational models of human syntactic category ac- quisition. They all rely on a gold standard set of manually de- termined categories. However, children’s syntactic categories change during language development, so evaluating against a fixed and final set of adult categories is not appropriate. In this paper, we propose a new measure, substitutable precision and recall, based on the idea that words which occur in similar syntactic environments share the same category. We use this measure to evaluate three standard category acquisition mod- els (hierarchical clustering, frequent frames, Bayesian HMM) and show that the results correlate well with those obtained using two gold-standard-based measures. Introduction By the time children reach school age, they have achieved the remarkable feat of acquiring most of their native language, typically without explicit instruction. This includes the ac- quisition of syntactic categories (noun, verb, adjective, etc.). A number of computational models of category learning have been developed, most of which conceptualize the problem as one of grouping together words whose syntactic behavior is similar. Typically, the input for the model is taken from a cor- pus of child-directed speech, and clusters are created based on distributional information (Redington et al., 1998; Mintz, 2003; Parisien et al., 2008). A problem common to all existing models is the evaluation of the model clusters. Often researchers have tested the output of their models against gold-standard category assignments, such as that available in the CHILDES database (MacWhin- ney, 2000). These gold-standard categories are based on the intuition of human annotators and are representative of adult morphosyntactic knowledge. Therefore, this type of evalua- tion is not ideal for assessing the syntactic categories of chil- dren, as these may include linguistically valid distinctions not recognized by the gold standard. Conversely, the gold stan- dard may make distinctions that children do not have, or only acquire during language development. For example, at the age of two, English-learning children have not fully acquired the verb category (Olguin & Tomasello, 1993), and functional categories such as determiners are acquired even later (Kemp et al., 2005). It is therefore highly desirable to develop an evaluation measure that does not make reference to an (adult) gold stan- dard. On the other hand, the measure should give results that correlate with gold-standard-based measures, indicating that it is capable of capturing the linguistic distinctions inherent in the gold-standard. Finally, the ideal measure needs to be applicable to a wide range of different acquisition models (e.g., it should not be limited to probabilistic models). This paper proposes a new evaluation measure which meets these criteria: substitutable precision and recall. It relies on a classical idea from linguistics, viz., that words which share the same syntactic category occur in similar syntactic envi- ronments. It does not require a gold standard, and therefore is suitable for evaluating pre-adult categories. At the same time, it yields results that correlate with gold-standard-based mea- sures. We will show this by applying our new measure, as well as existing measures, to three standard models that dis- cover syntactic categories in child-directed speech. This is the first time these models have been systematically compared; previous authors have used their own evaluation measures and only applied them to their own data sets, thus making a com- parison across models difficult. Gold-standard-based Evaluation Measures In the following section we describe two evaluation measures that have been used to evaluate category acquisition models. Both require gold-standard labeled data, which is problem- atic from an acquisition standpoint for the reasons previously discussed. Hand-labeled data is also scarce, particularly for languages other than English. Some of the models we investigate categorize word types (a type being a word such as duck), whereas others categorize tokens (particular instances of duck). In order to compare both kinds of models, the measures we describe are used to score tokens, not types. Matched Accuracy This measure is widely used in the field of Natural Language Processing for unsupervised part- of-speech tagging, in which the tokens of a text are automat- ically annotated (“tagged”) with cluster numbers. To obtain the matched accuracy MA, the clusters induced by the model are mapped onto the gold-standard categories in order to pro- vide a gold-standard part-of-speech label for each cluster. MA is then defined as the percentage of word tokens with correct category labels. The crucial aspect is the mapping between the clusters and the gold standard categories. In this paper, we use many-to-one accuracy, where each model cluster is matched onto the gold-standard category with which it shares the most tokens. This can result in a situation where multiple clusters are mapped onto the same gold standard category. This means the model is not penalized for creating more fine- grained clusters than the gold standard.
[1]
Afsaneh Fazly,et al.
An Incremental Bayesian Model for Learning Syntactic Categories
,
2008,
CoNLL.
[2]
B. MacWhinney.
The CHILDES project: tools for analyzing talk
,
1992
.
[3]
Toben H. Mintz.
Frequent frames as a cue for grammatical categories in child directed speech
,
2003,
Cognition.
[4]
Anna L. Theakston,et al.
The role of performance limitations in the acquisition of verb-argument structure: an alternative account.
,
2001,
Journal of child language.
[5]
Toben H. Mintz.
Category induction from distributional cues in an artificial language
,
2002,
Memory & cognition.
[6]
Hinrich Schütze,et al.
Distributional Part-of-Speech Tagging
,
1995,
EACL.
[7]
E. Markman,et al.
Rapid Word Learning in 13- and 18-Month-Olds.
,
1994
.
[8]
Z. Harris.
From Morpheme to Utterance
,
1946
.
[9]
M. Tomasello,et al.
Twenty-Five-Month-Old Children Do Not Have a Grammatical Category of Verb.
,
1993
.
[10]
R. Brown,et al.
THE ACQUISITION OF SYNTAX.
,
1964,
Monographs of the Society for Research in Child Development.
[11]
Thomas L. Griffiths,et al.
A fully Bayesian approach to unsupervised part-of-speech tagging
,
2007,
ACL.
[12]
Alexander Clark,et al.
Inducing Syntactic Categories by Context Distribution Clustering
,
2000,
CoNLL/LLL.
[13]
M. Tomasello,et al.
Young children's knowledge of the "determiner" and "adjective" categories.
,
2005,
Journal of speech, language, and hearing research : JSLHR.
[14]
R. Gómez,et al.
The Developmental Trajectory of Nonadjacent Dependency Learning.
,
2005,
Infancy : the official journal of the International Society on Infant Studies.
[15]
Nick Chater,et al.
Distributional Information: A Powerful Cue for Acquiring Syntactic Categories
,
1998,
Cogn. Sci..