Data-driven speech representations for NMF-based word learning

State-of-the-art solutions in ASR often rely on large amounts of expert prior knowledge, which is undesirable in some applications. In this paper, we consider a NMFbased framework that learns a small vocabulary of words directly from input data, without prior knowledge such as phone sets and dictionaries. In the context of this learning scheme, we compare several spectral representations of speech. Where necessary, we propose changes to their derivation to avoid the usage of prior linguistic knowledge. Also, in a comparison of several acoustic modelling techniques, we determine what model properties are beneficial to the framework’s performance.

[1]  Hugo Van hamme,et al.  Coding Methods for the NMF Approach to Speech Recognition and Vocabulary Acquisition , 2011 .

[2]  Guido Bugmann,et al.  Mobile robot programming using natural language , 2002, Robotics Auton. Syst..

[3]  W. Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[4]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[5]  Herbert Gish,et al.  Unsupervised Audio Patterns Discovery Using HMM-Based Self-Organized Units , 2011, INTERSPEECH.

[6]  Vladimir Estivill-Castro,et al.  Fast and Robust General Purpose Clustering Algorithms , 2000, PRICAI.

[7]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[8]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[9]  Joris Driesen,et al.  Discovering Words in Speech using Matrix Factorization (Het ontdekken van woorden in spraak met behulp van matrixfactorisatie) , 2012 .

[10]  Walter Daelemans,et al.  Towards a Self-Learning Assistive Vocal Interface: Vocabulary and Grammar Learning , 2012, SMIAE@ACL.

[11]  H. Sorenson,et al.  Recursive bayesian estimation using gaussian sums , 1971 .

[12]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[13]  Brian Scassellati,et al.  Robotic vocabulary building using extension inference and implicit contrast , 2009, Artificial Intelligence.

[14]  V. Estivill-Castro,et al.  A Fast and Robust General Purpose Clustering Algorithm , 2000 .

[15]  Toomas Altosaar,et al.  A Speech Corpus for Modeling Language Acquisition: CAREGIVER , 2010, LREC.

[16]  Sanjeev Khudanpur,et al.  Unsupervised Learning of Acoustic Sub-word Units , 2008, ACL.

[17]  José A. R. Fonollosa,et al.  Feature decorrelation methods in speech recognition. a comparative study , 1998, ICSLP.

[18]  Hugo Van hamme,et al.  Unsupervised vocabulary discovery using non-negative matrix factorization with graph regularization , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  O. Räsänen A computational model of word segmentation from continuous speech using transitional probabilities of atomic acoustic events , 2011, Cognition.

[20]  David A. van Leeuwen,et al.  Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[22]  Satoru Hayamizu,et al.  Socially Embedded Learning of the Office-Conversant Mobil Robot Jijo-2 , 1997, IJCAI.

[23]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[24]  Hugo Van hamme,et al.  HAC-models: a novel approach to continuous speech recognition , 2008, INTERSPEECH.

[25]  I. Jolliffe Principal Component Analysis , 2002 .

[26]  W. Feller,et al.  An Introduction to Probability Theory and its Applications, Vol. II , 1967 .

[27]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[28]  Kris Demuynck,et al.  Extracting, modelling and combining information in speech recognition , 2001 .