Discovering keywords from cross-modal input: ecological vs. engineering methods for enhancing acoustic repetitions

This paper introduces a computational model that automati­ cally segments acoustic speech data and builds internal repre­ sentations of keyword classes from cross-modal (acoustic and pseudo-visual) input. Acoustic segmentation is achieved using a novel dynamic time warping technique and the focus of this paper is on recent investigations conducted to enhance the iden­ tification of repeating portions of speech. This ongoing research is inspired by current cognitive views of early language acqui­ sition and therefore strives for ecological plausibility in an at­ tempt to build more robust speech recognition systems. Results show that an ad-hoc computationally engineered solution can aid the discovery of repeating acoustic patterns. However, we show that this improvement can be simulated in a more ecolog­ ically valid way.

[1]  R. Newman The Level of Detail in Infants' Word Learning , 2008 .

[2]  PG Hepper,et al.  Fetal memory: Does it exist? What does it do? , 1996, Acta paediatrica (Oslo, Norway : 1992). Supplement.

[3]  Wendy J. Holmes,et al.  Speech Synthesis and Recognition , 1988 .

[4]  Linda B. Smith,et al.  Infants rapidly learn word-referent mappings via cross-situational statistics , 2008, Cognition.

[5]  Bert Cranen,et al.  A computational model for unsupervised word discovery , 2007, INTERSPEECH.

[6]  I. Sigel,et al.  HANDBOOK OF CHILD PSYCHOLOGY , 2006 .

[7]  Masataka Goto,et al.  A chorus-section detecting method for musical audio signals , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  C.-C. Jay Kuo,et al.  Similar Segment Detection for Music Structure Analysis via Viterbi Algorithm , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[9]  A. Decasper,et al.  Prenatal maternal speech influences newborns' perception of speech sounds , 1986 .

[10]  James R. Glass,et al.  Unsupervised Pattern Discovery in Speech , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  P. Kuhl Early language acquisition: cracking the speech code , 2004, Nature Reviews Neuroscience.

[12]  R. Lasky,et al.  The Development of the Auditory System from Conception to Term , 2005 .

[13]  Jean-Claude Junqua,et al.  Robustness in language and speech technology (Text, Speech and Language Technology) , 2001 .

[14]  Malcolm Slaney,et al.  An Efficient Implementation of the Patterson-Holdsworth Auditory Filter Bank , 1997 .

[15]  Dylan M. Jones,et al.  Perceptual organization masquerading as phonological storage: Further support for a perceptual-gestural view of short-term memory , 2006 .

[16]  J. Mehler,et al.  Language discrimination by newborns: toward an understanding of the role of rhythm. , 1998, Journal of experimental psychology. Human perception and performance.

[17]  Roger K. Moore,et al.  The application of dynamic programming techniques to non-word based topic spotting , 1995, EUROSPEECH.

[18]  Guillaume Aimetti,et al.  Modelling Early Language Acquisition Skills: Towards a General Statistical Learning Mechanism , 2009, EACL.