论文信息 - 基於聽覺感知模型之類神經網路及其在語者識別上之應用 (Two-stage Attentional Auditory Model Inspired Neural Network and Its Application to Speaker Identification) [In Chinese]

基於聽覺感知模型之類神經網路及其在語者識別上之應用 (Two-stage Attentional Auditory Model Inspired Neural Network and Its Application to Speaker Identification) [In Chinese]

根據神經生理學研究,耳朵會針對聲音的各個頻率進行分頻,並產生出聽覺頻譜,研究人員根據專注聽覺現象和生物聽覺實驗,也發現了大腦聽覺皮質上神經作用的模式。於本論文中, 我們運用類神經網路,建構出一種模擬人類聽覺的類神經網路模型,並在語者識別這個應用上進行討論,期望能成功連結神經生理學的知識與工程的技術。而我們所設計的模型,是利用兩層不同維度的卷積神經網路(Convolutional Neural Network),分別模擬初期耳蝸階段及大腦皮質階段,透過設計卷積核初始值,即耳蝸階段多組一維分頻濾波器和大腦皮質階段同時解析時頻資訊的二維濾波器,以使模型能夠快速地達到收斂狀態。而透過模型訓練,根據目的與環境變因的不同,模型會自動調整其中參數,使輸入資料映射至目標的型態。同時我們也針對所提出的模型架構,進行了多種形態的比較,進而發現在給定初始值的狀況下,即使訓練不夠充分, 也能產生不錯的結果。

[1] Zhong-Qiu Wang,et al. Robust speech recognition from ratio masks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Mounya Elhilali,et al. Monkey Frequency-Modulation Encoding in the Primary Auditory Cortex of the Awake Owl , 2001 .

[3] Mounya Elhilali,et al. A spectro-temporal modulation index (STMI) for assessment of speech intelligibility , 2003, Speech Commun..

[4] Powen Ru,et al. Multiresolution spectrotemporal analysis of complex sounds. , 2005, The Journal of the Acoustical Society of America.

[5] R. Fay,et al. Auditory perception of sound sources , 2007 .

[6] Frederick Z. Yen,et al. Singing Voice Separation Using Spectro-Temporal Modulation Features , 2014, ISMIR.

[7] Jagannath H. Nirmal,et al. A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network , 2015, 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR).

[8] Tai-Shih Chi,et al. Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram. , 2011, The Journal of the Acoustical Society of America.

[9] Jean-Luc Schwartz,et al. An information theoretical investigation into the distribution of phonetic information across the auditory spectrogram , 1993, Comput. Speech Lang..

[10] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12] B C Moore,et al. Perceptual consequences of cochlear hearing loss and their implications for the design of hearing aids. , 1996, Ear and hearing.

[13] Yi Wang,et al. Speaker recognition based on MFCC and BP neural networks , 2017, 2017 28th Irish Signals and Systems Conference (ISSC).

[14] Ying Zhang,et al. Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[15] Johan Schalkwyk,et al. Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Jürgen Schmidhuber,et al. Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[17] L. Humes,et al. Speech-recognition difficulties of the hearing-impaired elderly: the contributions of audibility. , 1990, Journal of speech and hearing research.

[18] Yi-Cheng Chen,et al. Spectro-temporal modulation based singing detection combined with pitch-based grouping for singing voice separation , 2013, INTERSPEECH.

[19] DeLiang Wang,et al. Deep neural networks for cochannel speaker identification , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Wei Dai,et al. Very deep convolutional neural networks for raw waveforms , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Tara N. Sainath,et al. Learning the speech front-end with raw waveform CLDNNs , 2015, INTERSPEECH.

[22] Brian C J Moore,et al. Effect of enhancement of spectral changes on speech intelligibility and clarity preferences for the hearing impaired. , 2012, The Journal of the Acoustical Society of America.

[23] Zhong-Qiu Wang,et al. A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24] S. David,et al. Auditory attention : focusing the searchlight on sound , 2007 .

[25] Ron J. Weiss,et al. Speech acoustic modeling from raw multichannel waveforms , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Gerald Penn,et al. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Tai-Shih Chi,et al. Spectro-temporal modulation energy based mask for robust speaker identification. , 2012, The Journal of the Acoustical Society of America.

[28] Tai-Shih Chi,et al. Spectro-temporal modulations for robust speech emotion recognition , 2010, INTERSPEECH.

[29] Tara N. Sainath,et al. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).