基於聽覺感知模型之類神經網路及其在語者識別上之應用 (Two-stage Attentional Auditory Model Inspired Neural Network and Its Application to Speaker Identification) [In Chinese]

根據神經生理學研究,耳朵會針對聲音的各個頻率進行分頻,並產生出聽覺頻譜,研究人 員根據專注聽覺現象和生物聽覺實驗,也發現了大腦聽覺皮質上神經作用的模式。於本論文中, 我們運用類神經網路,建構出一種模擬人類聽覺的類神經網路模型,並在語者識別這個應用上 進行討論,期望能成功連結神經生理學的知識與工程的技術。而我們所設計的模型,是利用兩 層不同維度的卷積神經網路(Convolutional Neural Network),分別模擬初期耳蝸階段及大腦皮質 階段,透過設計卷積核初始值,即耳蝸階段多組一維分頻濾波器和大腦皮質階段同時解析時頻 資訊的二維濾波器,以使模型能夠快速地達到收斂狀態。而透過模型訓練,根據目的與環境變 因的不同,模型會自動調整其中參數,使輸入資料映射至目標的型態。同時我們也針對所提出 的模型架構,進行了多種形態的比較,進而發現在給定初始值的狀況下,即使訓練不夠充分, 也能產生不錯的結果。

[1]  Zhong-Qiu Wang,et al.  Robust speech recognition from ratio masks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mounya Elhilali,et al.  Monkey Frequency-Modulation Encoding in the Primary Auditory Cortex of the Awake Owl , 2001 .

[3]  Mounya Elhilali,et al.  A spectro-temporal modulation index (STMI) for assessment of speech intelligibility , 2003, Speech Commun..

[4]  Powen Ru,et al.  Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[5]  R. Fay,et al.  Auditory perception of sound sources , 2007 .

[6]  Frederick Z. Yen,et al.  Singing Voice Separation Using Spectro-Temporal Modulation Features , 2014, ISMIR.

[7]  Jagannath H. Nirmal,et al.  A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[8]  Tai-Shih Chi,et al.  Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram. , 2011, The Journal of the Acoustical Society of America.

[9]  Jean-Luc Schwartz,et al.  An information theoretical investigation into the distribution of phonetic information across the auditory spectrogram , 1993, Comput. Speech Lang..

[10]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  B C Moore,et al.  Perceptual consequences of cochlear hearing loss and their implications for the design of hearing aids. , 1996, Ear and hearing.

[13]  Yi Wang,et al.  Speaker recognition based on MFCC and BP neural networks , 2017, 2017 28th Irish Signals and Systems Conference (ISSC).

[14]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[15]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[17]  L. Humes,et al.  Speech-recognition difficulties of the hearing-impaired elderly: the contributions of audibility. , 1990, Journal of speech and hearing research.

[18]  Yi-Cheng Chen,et al.  Spectro-temporal modulation based singing detection combined with pitch-based grouping for singing voice separation , 2013, INTERSPEECH.

[19]  DeLiang Wang,et al.  Deep neural networks for cochannel speaker identification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Wei Dai,et al.  Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Tara N. Sainath,et al.  Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[22]  Brian C J Moore,et al.  Effect of enhancement of spectral changes on speech intelligibility and clarity preferences for the hearing impaired. , 2012, The Journal of the Acoustical Society of America.

[23]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  S. David,et al.  Auditory attention : focusing the searchlight on sound , 2007 .

[25]  Ron J. Weiss,et al.  Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Tai-Shih Chi,et al.  Spectro-temporal modulation energy based mask for robust speaker identification. , 2012, The Journal of the Acoustical Society of America.

[28]  Tai-Shih Chi,et al.  Spectro-temporal modulations for robust speech emotion recognition , 2010, INTERSPEECH.

[29]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).