Measuring the perceptual availability of phonological features during language acquisition using unsupervised binary stochastic autoencoders

In this paper, we deploy binary stochastic neural autoencoder networks as models of infant language learning in two typologically unrelated languages (Xitsonga and English). We show that the drive to model auditory percepts leads to latent clusters that partially align with theory-driven phonemic categories. We further evaluate the degree to which theory-driven phonological features are encoded in the latent bit patterns, finding that some (e.g. [+-approximant]), are well represented by the network in both languages, while others (e.g. [+-spread glottis]) are less so. Together, these findings suggest that many reliable cues to phonemic structure are immediately available to infants from bottom-up perceptual characteristics alone, but that these cues must eventually be supplemented by top-down lexical and phonotactic information to achieve adult-like phone discrimination. Our results also suggest differences in degree of perceptual availability between features, yielding testable predictions as to which features might depend more or less heavily on top-down cues during child language acquisition.

[1]  Linda Polka,et al.  A cross-language comparison of Õd Õ– ÕZ Õ perception: Evidence for a new developmental pattern , 2001 .

[2]  M. Landy,et al.  Bayesian Modelling of Visual Perception , 2002 .

[3]  A. Liberman,et al.  An Effect of Learning on Speech Perception: The Discrimination of Durations of Silence with and without Phonemic Significance , 1961 .

[4]  Hugo Lagercrantz,et al.  Language experienced in utero affects vowel perception after birth: a two‐country study , 2013, Acta paediatrica.

[5]  Karen Livescu,et al.  An embedded segmental K-means model for unsupervised segmentation and clustering of speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6]  J. Werker,et al.  Cross-language speech perception: Evidence for perceptual reorganization during the first year of life , 1984 .

[7]  Robert L. Goldstone,et al.  Categorical perception. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[8]  P. Kuhl Speech perception in early infancy: perceptual constancy for spectrally dissimilar vowel categories. , 1979, The Journal of the Acoustical Society of America.

[9]  Bernd J. Kröger,et al.  The emergence of phonetic-phonological features in a biologically inspired model of speech processing , 2015, J. Phonetics.

[10]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[11]  Valérie Hazan,et al.  The development of phonemic categorization in children aged 6-12 , 2000, J. Phonetics.

[12]  R. Lasky,et al.  VOT discrimination by four to six and a half month old infants from Spanish environments. , 1975, Journal of experimental child psychology.

[13]  Thomas Niesler,et al.  Automatic segmentation and clustering of speech using sparse coding and metaheuristic search , 2015, INTERSPEECH.

[14]  James R. Glass,et al.  A Nonparametric Bayesian Approach to Acoustic Model Discovery , 2012, ACL.

[15]  Katherine S White,et al.  Sub-Segmental Detail in Early Lexical Representations. , 2008 .

[16]  Sharon Peperkamp,et al.  Learning Phonemes With a Proto-Lexicon , 2013, Cogn. Sci..

[17]  Micha Elsner,et al.  Speech segmentation with a neural encoder model of working memory , 2017, EMNLP.

[18]  S. Nittrouer Challenging the notion of innate phonetic boundaries. , 2001, The Journal of the Acoustical Society of America.

[19]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[20]  E. Rolls,et al.  Neural networks and brain function , 1998 .

[21]  Roman Jakobson,et al.  Toward the Logical Description of Languages in their Phonemic Aspect (with Ε. Colin Cherry and Morris Halle) , 2002 .

[22]  Manish Shrivastava,et al.  Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings , 2016, SLSP.

[23]  Edmund T. Rolls,et al.  What determines the capacity of autoassociative memories in the brain? Network , 1991 .

[24]  Robert V Farese,et al.  Speech Perception in Infants , 1971 .

[25]  Sriram Ganapathy,et al.  Deep learning methods for unsupervised acoustic modeling — Leap submission to ZeroSpeech challenge 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[26]  Parul Parashar,et al.  Neural Networks in Machine Learning , 2014 .

[27]  Naomi Feldman,et al.  Weak semantic context helps phonetic learning in a model of infant language acquisition , 2014, ACL.

[28]  Elliott Moreton,et al.  STRUCTURALLY BIASED PHONOLOGY: COMPLEXITY IN LEARNING AND TYPOLOGY , 2012 .

[29]  Molly Babel,et al.  A cross-modal account for synchronic and diachronic patterns of /f/ and /θ/ in English , 2012 .

[30]  M. Beckman,et al.  The ontogeny of phonological categories and the primacy of lexical learning in linguistic development. , 2000, Child development.

[31]  Zellig S. Harris,et al.  Grundzüge der Phonologie@@@Grundzuge der Phonologie , 1941 .

[32]  Richard F. Lyon,et al.  Neural Networks for Machine Learning , 2017 .

[33]  Bernd J. Kröger,et al.  Towards a neurocomputational model of speech production and perception , 2009, Speech Commun..

[34]  Michael K. Tanenhaus,et al.  A context constructivist account of contextual diversity , 2018, CogSci.

[35]  R. M. Warren Perceptual Restoration of Missing Speech Sounds , 1970, Science.

[36]  Naomi H. Feldman,et al.  The influence of categories on perception: explaining the perceptual magnet effect as optimal statistical inference. , 2009, Psychological review.

[37]  Sharlene A. Liu,et al.  Landmark detection for distinctive feature-based speech recognition , 1996 .

[38]  Aren Jansen,et al.  The Zero Resource Speech Challenge 2015: Proposed Approaches and Results , 2016, SLTU.

[39]  Aren Jansen,et al.  A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge , 2015, INTERSPEECH.

[40]  A. Moffitt,et al.  Consonant cue perception by twenty- to twenty-four-week-old infants. , 1971, Child development.

[41]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Elizabeth K. Johnson,et al.  Learning to contend with accents in infancy: benefits of brief speaker exposure. , 2014, Journal of experimental psychology. General.

[43]  D B Pisoni,et al.  Discrimination of voice onset time by human infants: new findings and implications for the effects of early experience. , 1981, Child development.

[44]  James R. Glass,et al.  Feature-based Pronunciation Modeling for Speech Recognition , 2004, HLT-NAACL.

[45]  J. Feldman Symbolic representation of probabilistic worlds , 2012, Cognition.

[46]  Thomas Fang Zheng,et al.  Comparison of different implementations of MFCC , 2001, Journal of Computer Science and Technology.

[47]  R. Aslin,et al.  Statistical phonetic learning in infants: facilitation and feature generalization. , 2008, Developmental science.

[48]  T. Nazzi Use of phonetic specificity during the acquisition of new words: differences between consonants and vowels , 2005, Cognition.

[49]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[50]  Alan W. Black,et al.  Using articulatory features and inferred phonological segments in zero resource speech processing , 2015, INTERSPEECH.

[51]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[52]  Alta de Waal,et al.  A smartphone-based ASR data collection tool for under-resourced languages , 2014, Speech Commun..

[53]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[54]  Aren Jansen,et al.  Towards Unsupervised Training of Speaker Independent Acoustic Models , 2011, INTERSPEECH.

[55]  Carol Y. Espy-Wilson,et al.  Knowledge-based parameters for HMM speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[56]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[57]  Lorenzo Rosasco,et al.  Discovering discrete subword units with binarized autoencoders and hidden-Markov-model encoders , 2015, INTERSPEECH.

[58]  Silvia Benavides-Varela,et al.  Consonants and vowels: different roles in early language acquisition. , 2011, Developmental science.

[59]  Lawrence R. Rabiner,et al.  On the Relation between Maximum Spectra Boundaries , 2006 .

[60]  Michael K. Tanenhaus,et al.  The Weckud Wetch of the Wast: Lexical Adaptation to a Novel Accent , 2008, Cogn. Sci..

[61]  Bin Ma,et al.  Multilingual bottle-neck feature learning from untranscribed speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[62]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[63]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[64]  Alejandrina Cristià,et al.  Effects of the distribution of acoustic cues on infants' perception of sibilants , 2011, J. Phonetics.

[65]  Micha Elsner,et al.  Modeling Phonetic Category Learning from Natural Acoustic Data , 2017 .

[66]  Thomas L. Griffiths,et al.  Learning phonetic categories by learning a lexicon , 2009 .

[67]  E. Zwicker,et al.  Subdivision of the audible frequency range into critical bands , 1961 .

[68]  Chandan R. Narayan,et al.  The interaction between acoustic salience and language experience in developmental speech perception: evidence from nasal place discrimination. , 2010, Developmental science.

[69]  Michael I. Jordan,et al.  Sensorimotor adaptation in speech production. , 1998, Science.

[70]  Shinji Watanabe,et al.  Composite embedding systems for ZeroSpeech2017 Track1 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[71]  George N. Clements,et al.  The geometry of phonological features , 1985, Phonology Yearbook.

[72]  C. Scharff,et al.  Twitter evolution: converging mechanisms in birdsong and human speech , 2010, Nature Reviews Neuroscience.

[73]  Maarten Versteegh,et al.  A deep scattering spectrum — Deep Siamese network pipeline for unsupervised acoustic modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Roger Levy,et al.  Nonparametric Learning of Phonological Constraints in Optimality Theory , 2014, ACL.

[75]  Friedemann Pulvermüller,et al.  Motor cortex maps articulatory features of speech sounds , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[76]  G. Rizzolatti,et al.  Speech listening specifically modulates the excitability of tongue muscles: a TMS study , 2002, The European journal of neuroscience.

[77]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[78]  Hynek Hermansky,et al.  Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP) , 1991, EUROSPEECH.

[79]  Bin Ma,et al.  Parallel inference of dirichlet process Gaussian mixture models for unsupervised acoustic modeling: a feasibility study , 2015, INTERSPEECH.

[80]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Elliott Moreton,et al.  Structure and Substance in Artificial-Phonology Learning, Part II: Substance , 2012, Lang. Linguistics Compass.

[82]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[83]  Okko Räsänen,et al.  Pre-linguistic segmentation of speech into syllable-like units , 2018, Cognition.

[84]  Satoshi Nakamura,et al.  Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to zerospeech 2017 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[85]  Elliott Moreton,et al.  Structure and Substance in Artificial-phonology Learning, Part I: Structure , 2012, Lang. Linguistics Compass.

[86]  A. Baddeley,et al.  The phonological loop as a language learning device. , 1998, Psychological review.

[87]  Roger Levy,et al.  Data-driven learning of symbolic constraints for a log-linear model in a phonological setting , 2016, COLING.

[88]  Roman Jakobson,et al.  Towards a logical description of languages in their phonemic aspects , 1953 .

[89]  D. Swingley,et al.  Contributions of infant word learning to language development , 2009, Philosophical Transactions of the Royal Society B: Biological Sciences.

[90]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[91]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[92]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[93]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[94]  Constance M. Clarke,et al.  Rapid adaptation to foreign-accented English. , 2004, The Journal of the Acoustical Society of America.

[95]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[96]  Andrew J King,et al.  Sensory cortex is optimized for prediction of future input , 2017, bioRxiv.

[97]  Jeff Mielke,et al.  Segment Inventories , 2009, Lang. Linguistics Compass.

[98]  K. Stevens,et al.  Linguistic experience alters phonetic perception in infants by 6 months of age. , 1992, Science.

[99]  Sandra E. Trehub,et al.  Infants' sensitivity to vowel and tonal contrasts. , 1973 .

[100]  M. Iacoboni,et al.  Listening to speech activates motor areas involved in speech production , 2004, Nature Neuroscience.

[101]  B. Lindblom,et al.  Numerical Simulation of Vowel Quality Systems: The Role of Perceptual Contrast , 1972 .

[102]  Trevor Bekolay,et al.  Biologically inspired methods in speech recognition and synthesis: closing the loop , 2016 .

[103]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .

[104]  Noam Chomsky,et al.  The Sound Pattern of English , 1968 .

[105]  T. Paus,et al.  Seeing and hearing speech excites the motor system involved in speech production , 2003, Neuropsychologia.

[106]  Bin Ma,et al.  Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[107]  Aren Jansen,et al.  An evaluation of graph clustering methods for unsupervised term discovery , 2015, INTERSPEECH.

[108]  James L. McClelland,et al.  Unsupervised learning of vowel categories from infant-directed speech , 2007, Proceedings of the National Academy of Sciences.

[109]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[110]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[111]  DeLiang Wang,et al.  An auditory-based feature for robust speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[112]  Simon King,et al.  ASR - articulatory speech recognition , 2001, INTERSPEECH.

[113]  Peter W. Jusczyk,et al.  Representation of Speech Sounds by Young Infants. , 1987 .

[114]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[115]  M. D’Esposito Working memory. , 2008, Handbook of clinical neurology.

[116]  William D. Raymond,et al.  The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability , 2005, Speech Commun..

[117]  Sharon Goldwater,et al.  A role for the developing lexicon in phonetic category acquisition. , 2013, Psychological review.

[118]  Michael C. Frank,et al.  Unsupervised word discovery from speech using automatic segmentation into syllable-like units , 2015, INTERSPEECH.

[119]  S. Dehaene,et al.  Speed and cerebral correlates of syllable discrimination in infants , 1994, Nature.

[120]  Carol Y. Espy-Wilson,et al.  Robust speech recognition using articulatory gestures in a Dynamic Bayesian Network framework , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[121]  V. Susheela Devi,et al.  Unsupervised HMM posteriograms for language independent acoustic modeling in zero resource conditions , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[122]  Emily B. Myers,et al.  Word-level information influences phonetic learning in adults and infants , 2013, Cognition.

[123]  Aren Jansen,et al.  A segmental framework for fully-unsupervised large-vocabulary speech recognition , 2016, Comput. Speech Lang..

[124]  Satoshi Nakamura,et al.  Unsupervised Linear Discriminant Analysis for Supporting DPGMM Clustering in the Zero Resource Scenario , 2016, SLTU.

[125]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[126]  J. Nadal,et al.  The acquisition of allophonic rules: Statistical learning with linguistic constraints , 2006, Cognition.

[127]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[128]  Ferran Pons,et al.  Structural generalizations over consonants and vowels in 11-month-old infants , 2010, Cognition.