Comparing Semantic Space Models Using Child-directed Speech Brian Riordan (briordan@indiana.edu) Department of Linguistics, 1021 E. Third St. Indiana University, Bloomington, IN 47405 USA Michael N. Jones (jonesmn@indiana.edu) Department of Psychological and Brain Sciences, 1101 E. Tenth St. Indiana University, Bloomington, IN 47405 USA for 365 days of the year, it would take more than four years to read the full 100 million words of the BNC. This would make 12 years to encounter HAL’s 300 million words, and 48 years to encounter all of the words COALS is trained on. At the very least, it would seem that these models are trained on the very high end of a scale of possible human input. For the most part, semantic space modelers have only assessed model predictions after the entire training corpus has been processed (the exceptions being LSA (Landauer & Dumais, 1997) and BEAGLE (Jones & Mewhort, 2007)). What is lacking is a consideration of the rate at which the model learned its representations – information which may be crucial for assessing model plausibility. In order to remove these potential advantages, in this study we compare a variety of semantic space models from the cognitive science literature using age-stratified child- directed speech (CDS) from the CHILDES database. For several reasons, CDS may offer us the important ability to decide between equally plausible models that perform comparably at a larger learning scale. First, CDS is arguably much more realistic than the adult corpora that semantic space models have been trained on: we know that children learn the meanings of words with this kind of input. Second, since the size of any corpus derived from the CHILDES database will be much smaller than other training corpora, it is more likely to be in the range of input for a human learner. Third, the caregiver speech in the CHILDES database can be divided according to the age of the target child. This allows the construction of training corpora that reflect changes in input over time, similar to what children are actually exposed to. Two previous studies have explored the behavior of semantic space models when trained on CDS. Li, Burgess, and Lund (2000) trained HAL on the caregiver speech in CHILDES, at the time 3.8 million words. Denhiere and Lemaire (2004) derived an LSA space from a 3.2 million word French corpus that included both children’s speech and stories, textbooks, and encyclopedia articles written for children. However, it is not clear what is being modeled in these studies, as the training corpora aggregate a great deal of data from the linguistic environments of children of a variety of ages. The modeling target crucially affects the data on which the models should be evaluated. Abstract A number of semantic space models from the cognitive science literature were compared by training on a corpus of child-directed speech and evaluating on three increasingly rigorous semantic tasks. The performance of families of models varied with the type of semantic data, and not all models were reasonably successful on each task, suggesting a narrowing of the space of plausible model architectures. Keywords: semantic space models; child-directed speech; lexical development Introduction Semantic space models have proven successful at accounting for a broad range of semantic data, in particular semantic priming (Jones, Kintsch, & Mewhort, 2006; Lowe & McDonald, 2000). Since all the models are successful at accounting for the semantic data in most cases, however, finding tasks where the models make different predictions, and narrowing the space of plausible models, has proven to be quite difficult. Semantic space models have traditionally been trained on adult language input. Further, the models are trained on very large corpora – in many cases, more data than humans experience. Finally, the models are usually only applied to modeling semantic data after processing the entire training corpus. Each of these steps is problematic. The corpora semantic space models have been trained on range from Usenet postings (Burgess, Livesay, & Lund, 1998; Rohde, Gonnerman, & Plaut, submitted; Shaoul & Westbury, 2006) to the British National Corpus (Bullinaria & Levy, in press; Lowe & McDonald, 2000) to the TASA corpus (Jones & Mewhort, 2007). These corpora vary widely in their content and representativeness of human experience. However, the rationale for using a particular corpus is rarely supported by an evaluation of its representativeness. For example, Burgess et al. (1998) motivate the use of Usenet by claiming that Usenet represents “everyday speech” and is “conversationally diverse” – without presenting an analysis of the corpus that would justify this claim. The training corpora for semantic space models are not only diverse, but large. The BNC totals 100 million words, the Usenet corpora used for HAL and HiDEX approach 300 million words, while COALS is trained on more than 1.2 billion words. It has been estimated that at a rate of 150 words per minute (a high estimate), reading 8 hours per day
[1]
Michael N Jones,et al.
Representing word meaning and order information in a composite holographic lexicon.
,
2007,
Psychological review.
[2]
Richard Shillcock,et al.
Contextual Distinctiveness: a new lexical property computed from large corpora
,
2001
.
[3]
W. Lowe,et al.
Modelling functional priming and the associative boost
,
1998
.
[4]
J. H. Steiger.
Tests for comparing elements of a correlation matrix.
,
1980
.
[5]
J. Bullinaria,et al.
Extracting semantic representations from word co-occurrence statistics: A computational study
,
2007,
Behavior research methods.
[6]
L. Fenson,et al.
Lexical development norms for young children
,
1996
.
[7]
Cyrus Shaoul,et al.
Word frequency effects in high-dimensional co-occurrence models: A new approach
,
2006,
Behavior research methods.
[8]
Mirella Lapata,et al.
Constructing Semantic Space Models from Parsed Corpora
,
2003,
ACL.
[9]
Douglas L. T. Rohde,et al.
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
,
2005
.
[10]
G. Denhière,et al.
A Computational Model of Children's Semantic Memory
,
2004
.
[11]
D. Howard,et al.
Age of acquisition and imageability ratings for a large set of words, including verbs and function words
,
2001,
Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.
[12]
Thomas A. Schreiber,et al.
The University of South Florida free association, rhyme, and word fragment norms
,
2004,
Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.
[13]
W. Kintsch,et al.
High-Dimensional Semantic Space Accounts of Priming.
,
2006
.
[14]
Magnus Sahlgren,et al.
The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces
,
2006
.
[15]
William S. Maki,et al.
Semantic distance norms computed from an electronic dictionary (WordNet)
,
2004,
Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.
[16]
Ping Li,et al.
The Acquisition of Word Meaning through Global Lexical Co-occurrences
,
2000
.
[17]
Curt Burgess,et al.
Explorations in context space: Words, sentences, discourse
,
1998
.
[18]
Arthur C. Graesser,et al.
Similarity Between Semantic Spaces
,
2005
.
[19]
W. Lowe,et al.
The Direct Route: Mediated Priming in Semantic Space
,
2000
.
[20]
T. Landauer,et al.
A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge.
,
1997
.