Deep Factorization for Speech Signal

Various informative factors mixed in speech signals, leading to great difficulty when decoding any of the factors. An intuitive idea is to factorize each speech frame into individual informative factors, though it turns out to be highly difficult. Recently, we found that speaker traits, which were assumed to be long-term distributional properties, are actually short-time patterns, and can be learned by a carefully designed deep neural network (DNN). This discovery motivated a cascade deep factorization (CDF) framework that will be presented in this paper. The proposed framework infers speech factors in a sequential way, where factors previously inferred are used as conditional variables when inferring other factors. We will show that this approach can effectively factorize speech signals, and using these factors, the original speech spectrum can be recovered with a high accuracy. This factorization and reconstruction approach provides potential values for many speech processing tasks, e.g., speaker recognition and emotion recognition, as will be demonstrated in the paper.

[1]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[3]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Xiangang Li,et al.  Modeling speaker variability using long short-term memory networks for speech recognition , 2015, INTERSPEECH.

[6]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[7]  Fakhri Karray,et al.  Survey on speech emotion recognition: Features, classification schemes, and databases , 2011, Pattern Recognit..

[8]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[9]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[10]  Dong Wang,et al.  Deep Speaker Feature Learning for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[11]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[14]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[15]  Yoshinori Sagisaka,et al.  Computing Prosody, Computational Models for Processing Spontaneous Speech , 2011 .

[16]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[17]  Mingliang Gu,et al.  Building a Chinese Natural Emotional Audio-Visual Database , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[18]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Dong Yu,et al.  Automatic Speech Recognition: A Deep Learning Approach , 2014 .

[20]  Dong Yu,et al.  Neural Network Based Multi-Factor Aware Joint Training for Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[22]  Dong Wang,et al.  Speaker recognition with cough, laugh and "Wei" , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[23]  Shiguang Shan,et al.  MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016 , 2016, CCPR.

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Dong Wang,et al.  Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[27]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[28]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[29]  Hiroya Fujisaki Communication between minds: The ultimate goal of speech communication and the target of research for the next half‐century , 1998 .

[30]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Thomas Fang Zheng,et al.  Cross-lingual speaker verification with deep feature learning , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[32]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).