Auditory Cortical Representations of Speech Signals for Phoneme Classification

The use of biologically inspired, feature extraction methods has improved the performance of artificial systems that try to emulate some aspect of human communication. Recent techniques, such as independent component analysis and sparse representations, have made it possible to undertake speech signal analysis using features similar to the ones found experimentally at the primary auditory cortex level. In this work, a new type of speech signal representation, based on the spectro-temporal receptive fields, is presented, and a problem of phoneme classification is tackled for the first time using this representation. The results obtained are compared, and found to greatly improve both an early auditory representation and the classical front-end based on Mel frequency cepstral coefficients.

[1]  James V. Stone Independent Component Analysis: A Tutorial Introduction , 2007 .

[2]  Oh-Wook Kwon,et al.  Phoneme recognition using ICA-based feature extraction and transformation , 2004, Signal Process..

[3]  Aapo Hyvärinen,et al.  Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Likelihood Estimation , 1999, Neural Computation.

[4]  Jonathan Z. Simon,et al.  Temporal Symmetry in Primary Auditory Cortex: Implications for Cortical Connectivity , 2006, Neural Computation.

[5]  Aapo Hyvärinen,et al.  Independent Component Analysis: A Tutorial , 1999 .

[6]  M. Merzenich,et al.  Optimizing sound features for cortical neurons. , 1998, Science.

[7]  George Francis Harpur,et al.  Low Entropy Coding with Unsupervised Neural Networks , 1997 .

[8]  K. Sen,et al.  Spectral-temporal Receptive Fields of Nonlinear Auditory Neurons Obtained Using Natural Sounds , 2022 .

[9]  K.P. Kording,et al.  Learning of sparse auditory receptive fields , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[10]  Bruno A. Olshausen,et al.  Sparse Codes and Spikes , 2001 .

[11]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[12]  M. E. Torres,et al.  Statistical method for sparse coding of speech including a linear predictive model , 2006 .

[13]  Steven Greenberg THE EARS HAVE IT : THE AUDITORY BASIS OF SPEECH PERCEPTION , 1995 .

[14]  Konrad P. Körding,et al.  Sparse Spectrotemporal Coding of Sounds , 2003, EURASIP J. Adv. Signal Process..

[15]  Rajesh P. N. Rao,et al.  Probabilistic Models of the Brain: Perception and Neural Function , 2002 .

[16]  Bruno A. Olshausen,et al.  PROBABILISTIC FRAMEWORK FOR THE ADAPTATION AND COMPARISON OF IMAGE CODES , 1999 .

[17]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[18]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  H Barlow,et al.  Redundancy reduction revisited , 2001, Network.

[20]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[21]  B. Delgutte Physiological Models for Basic Auditory Percepts , 1996 .

[22]  Samer A. Abdallah,et al.  Towards music perception by redundancy reduction and unsupervised learning in probabilistic models , 2002 .

[23]  Kuansan Wang,et al.  Auditory representations of acoustic signals , 1992, IEEE Trans. Inf. Theory.

[24]  Terrence J. Sejnowski,et al.  Learning Nonlinear Overcomplete Representations for Efficient Coding , 1997, NIPS.

[25]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[26]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.