Deriving Spectro-temporal Properties of Hearing from Speech Data

Human hearing and human speech are intrinsically tied together, as the properties of speech almost certainly developed in order to be heard by human ears. As a result of this connection, it has been shown that certain properties of human hearing are mimicked within data-driven systems that are trained to understand human speech. In this paper, we further explore this phenomenon by measuring the spectro-temporal responses of data-derived filters in a front-end convolutional layer of a deep network trained to classify the phonemes of clean speech. The analyses show that the filters do indeed exhibit spectro-temporal responses similar to those measured in mammals, and also that the filters exhibit an additional level of frequency selectivity, similar to the processing pipeline assumed within the Articulation Index.

[1]  Mitch Weintraub,et al.  The Hub and Spoke Paradigm for CSR Evaluation , 1994, HLT.

[2]  Alain Biem,et al.  Filter bank design based on discriminative feature extraction , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  R V Shannon,et al.  Speech Recognition with Primarily Temporal Cues , 1995, Science.

[4]  Leon Cohen,et al.  Frequency-warping in speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[5]  Sarel van Vuuren,et al.  Data-driven design of RASTA-like filters , 1997, EUROSPEECH.

[6]  Hynek Hermansky,et al.  Spectral basis functions from discriminant analysis , 1998, ICSLP.

[7]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[8]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[9]  Michael C. Corballis,et al.  From Hand to Mouth: The Origins of Language , 2002 .

[10]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[11]  Fabio Valente,et al.  Discriminant linear processing of time-frequency plane , 2006, INTERSPEECH.

[12]  Kuldip K. Paliwal,et al.  Speech-Signal-Based Frequency Warping , 2009, IEEE Signal Processing Letters.

[13]  J. C. Steinberg,et al.  Factors Governing the Intelligibility of Speech Sounds , 1945 .

[14]  Dimitri Palaz,et al.  Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[15]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[16]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[17]  Hermann Ney,et al.  Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[18]  Hermann Ney,et al.  Convolutional neural networks for acoustic modeling of raw time signal in LVCSR , 2015, INTERSPEECH.

[19]  Hynek Hermansky,et al.  DNN derived filters for processing of modulation spectrum of speech , 2015, INTERSPEECH.

[20]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .