Deep generative factorization for speech signal

Various information factors are blended in speech signals, which forms the primary difficulty for most speech information processing tasks. An intuitive idea is to factorize speech signal into individual information factors (e.g., phonetic content and speaker trait), though it turns out to be highly challenging. This paper presents a speech factorization approach based on a novel factorial discriminative normalization flow model (factorial DNF). Experiments conducted on a two-factor case that involves phonetic content and speaker trait demonstrates that the proposed factorial DNF has powerful capability to factorize speech signals and outperforms several comparative models in terms of information representation and manipulation.

[1]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Patrick Kenny,et al.  Maximum likelihood estimation of eigenvoices and residual variances for large vocabulary speech recognition tasks , 2002, INTERSPEECH.

[5]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[6]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[7]  Sean A. Fulop Speech Spectrum Analysis , 2011 .

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Dong Wang,et al.  Deep Factorization for Speech Signal , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Lantian Li,et al.  Deep Normalization for Speaker Vectors , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Hiroya Fujisaki,et al.  Prosody, Models, and Spontaneous Speech , 1997, Computing Prosody.

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[15]  Hao Tang,et al.  An Unsupervised Autoregressive Model for Speech Representation Learning , 2019, INTERSPEECH.

[16]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[17]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[18]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[19]  Marwan Al-Akaidi,et al.  Introduction to speech processing , 2004 .

[20]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[21]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[22]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.