论文信息 - Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling

Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling

State-of-the-art automatic speech recognition (ASR) systems typically rely on pre-processed features. This paper studies the time-frequency duality in ASR feature extraction methods and proposes extending the standard acoustic model with a complex-valued linear projection layer to learn and optimize features that minimize standard cost functions such as crossentropy. The proposed Complex Linear Projection (CLP) features achieve superior performance compared to pre-processed Log Mel features.

Tara N. Sainath | Izhak Shafran | Michiel Bacchiani | Ehsan Variani

[1] Tara N. Sainath,et al. Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[2] D J Field,et al. Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[3] M. Picheny,et al. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[4] Hynek Hermansky,et al. Data Driven Design of Filter Bank for Speech Recognition , 2001, TSD.

[5] Boaz Rafaely,et al. Microphone Array Signal Processing , 2008 .

[6] Dong Yu,et al. Exploiting sparseness in deep neural networks for large vocabulary speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Hermann Ney,et al. Acoustic modeling with deep neural networks using raw time signal for LVCSR , 2014, INTERSPEECH.

[8] Dimitri Palaz,et al. Analysis of CNN-based speech recognition system using raw speech as input , 2015, INTERSPEECH.

[9] Marc'Aurelio Ranzato,et al. Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[10] D. Gabor,et al. Theory of communication. Part 1: The analysis of information , 1946 .

[11] Simon Haykin,et al. Adaptive Signal Processing: Next Generation Solutions , 2010 .

[12] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[13] Geoffrey E. Hinton,et al. Learning a better representation of speech soundwaves using restricted boltzmann machines , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Qiang Chen,et al. Network In Network , 2013, ICLR.

[15] Alain Biem,et al. A discriminative filter bank model for speech recognition , 1995, EUROSPEECH.

[16] Tara N. Sainath,et al. Factored spatial and spectral multichannel raw waveform CLDNNs , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Lawrence D. Jackel,et al. Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[18] T J Sejnowski,et al. Learning the higher-order structure of a natural sound. , 1996, Network.

[19] Michael S. Lewicki,et al. Efficient coding of natural sounds , 2002, Nature Neuroscience.

[20] Bingbing Ni,et al. Geometric ℓp-norm feature pooling for image classification , 2011, CVPR 2011.

[21] R Linsker,et al. Perceptual neural organization: some approaches based on network models and information theory. , 1990, Annual review of neuroscience.

[22] Dimitri Palaz,et al. Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks , 2013, INTERSPEECH.

[23] Hermann Ney,et al. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR , 2015, INTERSPEECH.

[24] Tara N. Sainath,et al. Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25] Terrence J. Sejnowski,et al. An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[26] Tara N. Sainath,et al. Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[27] Ron J. Weiss,et al. Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[29] A. Aertsen,et al. Spectro-temporal receptive fields of auditory neurons in the grassfrog , 1980, Biological Cybernetics.