One-shot learning of generative speech concepts

One-shot learning of generative speech concepts Brenden M. Lake* Chia-ying Lee* James R. Glass Joshua B. Tenenbaum Brain and Cognitive Sciences MIT CSAIL MIT CSAIL MIT Brain and Cognitive Sciences MIT Abstract 2007). Related computational work has investigated other factors that contribute to learning word meaning, including learning-to-learn which features are important (Colunga & Smith, 2005; Kemp et al., 2007) and cross-situational word learning (Smith & Yu, 2008; Frank, Goodman, & Tenen- baum, 2009). But by any account, the acquisition of mean- ing is only possible because the child can also learn the spo- ken word as a category, mapping all instances (and exclud- ing non-instances) of a word like “elephant” to the same phonological representation, regardless of speaker identify and other sources of acoustic variability. This is the focus of the current paper. Previous work has shown that chil- dren can do one-shot spoken word learning (Carey & Bartlett, 1978). When children (ages 3-4) were asked to bring over a “chromium” colored object, they seemed to flag the sound as a new word; some even later produced their own approxima- tion of the word “chromium.” Furthermore, acquiring new spoken words remains an important problem well into adult- hood whether its learning a second language, a new name, or a new vocabulary word. The goal of our work is twofold: to develop one-shot learn- ing tasks that can compare people and models side-by-side, and to develop a computational model that performs well on these tasks. Since the tasks must contain novel words for both people and algorithms, we tested English speakers on their ability to learn Japanese words. This language pairing also offers an interesting test case for learning-to-learn through the transfer of phonetic structure, since the Japanese analogs to English phonemes fall roughly within a subset of English phonemes (Ohata, 2004). Can the recent progress on models of one-shot learning be leveraged for learning new spoken words from raw speech? How could a generative model of a word be learned from just one example? Recent behavioral and computational work suggests that compositionality, combined with Hierarchical Bayesian modeling, can be a powerful way to build a “gen- erative model for generative models” that supports one-shot learning (Lake, Salakhutdinov, & Tenenbaum, 2012; Lake et al., 2013). This idea was applied to the one-shot learning of handwritten characters, a similarly high-dimensional do- main of natural concepts, using an “analysis-by-synthesis” approach. Given a raw image of a novel character, the model learns to represent it by a latent dynamic causal process, com- posed of pen strokes and their spatial relations (Fig. 1a). The sharing of stochastic motor primitives across concepts (Fig. 1a-i) provides a means of synthesizing new generative mod- els out of pieces of existing ones (Fig. 1a-iii). Compositional generative models are well-suited for the problem of spoken word acquisition, as they relate to classic One-shot learning – the human ability to learn a new concept from just one or a few examples – poses a challenge to tradi- tional learning algorithms, although approaches based on Hi- erarchical Bayesian models and compositional representations have been making headway. This paper investigates how chil- dren and adults readily learn the spoken form of new words from one example – recognizing arbitrary instances of a novel phonological sequence, and excluding non-instances, regard- less of speaker identity and acoustic variability. This is an es- sential step on the way to learning a word’s meaning and learn- ing to use it, and we develop a Hierarchical Bayesian acoustic model that can learn spoken words from one example, utiliz- ing compositions of phoneme-like units that are the product of unsupervised learning. We compare people and computa- tional models on one-shot classification and generation tasks with novel Japanese words, finding that the learned units play an important role in achieving good performance. Keywords: one-shot learning; speech recognition; category learning; exemplar generation Introduction People can learn a new concept from just one or a few ex- amples, making meaningful generalizations that go far be- yond the observed data. Replicating this ability in machines has been challenging, since standard learning algorithms re- quire tens, hundreds, or thousands of examples before reach- ing a high level of classification performance. Nonetheless, recent interest from cognitive science and machine learning has advanced our computational understanding of “one-shot learning,” and several key themes have emerged. Proba- bilistic generative models can predict how people general- ize from just one or a few examples, as shown for data ly- ing in a low-dimensional space (Shepard, 1987; Tenenbaum & Griffiths, 2001). Another theme has developed around learning-to-learn, the idea that one-shot learning itself de- velops from previous learning with related concepts, and Hi- erarchical Bayesian (HB) models can learn-to-learn by high- lighting the dimensions or features that are most important for generalization (Fei-Fei, Fergus, & Perona, 2006; Kemp, Perfors, & Tenenbaum, 2007; Salakhutdinov, Tenenbaum, & Torralba, 2012). In this paper, we study the problem of learning new spoken words, an essential ingredient for language development. By one estimate, children learn an average of ten new words per day from the age of one to the end of high school (Bloom, 2000). For learning to proceed at such an astounding rate, children must be learning new words from very little data. Previous computational work has focused on the problem of learning the meaning of words from a few examples; for in- stance, upon hearing the word “elephant” paired with an ex- emplar, the child must decide which objects belong to the set of “elephants” and which do not (e.g., Xu & Tenenbaum, * The first two authors contributed equally to this work.

[1]  Matthew J. Johnson,et al.  Bayesian nonparametric hidden semi-Markov models , 2012, J. Mach. Learn. Res..

[2]  J. Tenenbaum,et al.  Generalization, similarity, and Bayesian inference. , 2001, The Behavioral and brain sciences.

[3]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[4]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[5]  Kenneth N. Stevens,et al.  Speech recognition: A model and a program for research , 1962, IRE Trans. Inf. Theory.

[6]  Sharon Goldwater,et al.  A role for the developing lexicon in phonetic category acquisition. , 2013, Psychological review.

[7]  Susan Carey,et al.  Acquiring a Single New Word , 1978 .

[8]  Linda B. Smith,et al.  From the lexicon to expectations about kinds: a role for associative learning. , 2005, Psychological review.

[9]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[10]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[11]  Title Phonological Differences between Japanese and English : Several Potentially Problematic Areas of Pronunciation for Japanese ESL / EFL Learners , 2008 .

[12]  Joshua B. Tenenbaum,et al.  One-shot learning by inverting a compositional causal process , 2013, NIPS.

[13]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[14]  Joshua B. Tenenbaum,et al.  One-Shot Learning with a Hierarchical Nonparametric Bayesian Model , 2011, ICML Unsupervised and Transfer Learning.

[15]  A M Liberman,et al.  Perception of the speech code. , 1967, Psychological review.

[16]  Todd M. Gureckis,et al.  Evaluating Amazon's Mechanical Turk as a Tool for Experimental Behavioral Research , 2013, PloS one.

[17]  Joshua B. Tenenbaum,et al.  Concept learning as motor program induction: A large-scale empirical study , 2012, CogSci.

[18]  J. Tenenbaum,et al.  Word learning as Bayesian inference. , 2007, Psychological review.

[19]  David Poeppel,et al.  Analysis by Synthesis: A (Re-)Emerging Program of Research for Language and Vision , 2010, Biolinguistics.

[20]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[22]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[23]  Fei-FeiLi,et al.  One-Shot Learning of Object Categories , 2006 .

[24]  R. Shepard,et al.  Toward a universal law of generalization for psychological science. , 1987, Science.

[25]  James L. McClelland,et al.  Unsupervised learning of vowel categories from infant-directed speech , 2007, Proceedings of the National Academy of Sciences.

[26]  D M Ennis,et al.  Toward a universal law of generalization. , 1988, Science.

[27]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[28]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[29]  P. Bloom How children learn the meanings of words , 2000 .

[30]  Michael C. Frank,et al.  PSYCHOLOGICAL SCIENCE Research Article Using Speakers ’ Referential Intentions to Model Early Cross-Situational Word Learning , 2022 .

[31]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[32]  J. Tenenbaum,et al.  Bayesian Special Section Learning Overhypotheses with Hierarchical Bayesian Models , 2022 .