Combining NDHMM and phonetic feature detection for speech recognition

Non-negative HMM (N-HMM) [1] is a model that is well suited for modeling a mixture of e.g. audio signals, but does not have the ability to generalize to model unseen data. Non-negative durational HMM (NdHMM) has recently been proposed [2] as a modification to N-HMM that can allow for generalization, and thus make the approach suitable for automatic speech recognition. A detector-based approach to speech recognition has been studied by several researchers as an alternative to the traditional HMM approach. A bank of phonetic feature detectors will produce phonetic feature posteriors, which fit well with the non-negativity constraint of NdHMM. We review the NdHMM approach proposed in [2] and propose to extend this approach by combining NdHMM with a phonetic feature detection front-end in a tandem-like system. Experimental results of the proposed approach are presented.

[1]  Torbjørn Svendsen,et al.  Non-negative durational HMM , 2013, 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[2]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  B. Raj,et al.  Latent variable decomposition of spectrograms for single channel speaker separation , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[5]  Bhiksha Raj,et al.  Non-negative Hidden Markov Modeling of Audio with Application to Source Separation , 2010, LVA/ICA.

[6]  Sabato Marco Siniscalchi,et al.  Combining speech attribute detection and penalized logistic regression for phoneme recognition , 2012, Neurocomputing.

[7]  Jarle Bauck Hamar Using Sub-Phonemic Units for HMM Based Phone Recognition , 2013 .

[8]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[9]  Chin-Hui Lee,et al.  An Information-Extraction Approach to Speech Processing: Analysis, Detection, Verification, and Recognition , 2013, Proceedings of the IEEE.

[10]  Dau-Cheng Lyu,et al.  Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Julius O. Smith,et al.  A non-negative framework for joint modeling of spectral structure and temporal dynamics in sound mixtures , 2010 .

[12]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[13]  Hugo Van hamme,et al.  Discovering Phone Patterns in Spoken Utterances by Non-Negative Matrix Factorization , 2008, IEEE Signal Processing Letters.

[14]  Ieee Staff 2017 25th European Signal Processing Conference (EUSIPCO) , 2017 .

[15]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[16]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Barak A. Pearlmutter,et al.  Convolutive Non-Negative Matrix Factorisation with a Sparseness Constraint , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[18]  Paris Smaragdis,et al.  A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).