Augmented speech production based on real-time statistical voice conversion

In human-to-human speech communication, various barriers are caused by some constraints, such as physical constraints causing vocal disorders and environmental constraints making it hard to produce intelligible speech. These barriers would be overcome if our speech production was augmented so that we could produce speech sounds as we want beyond these constraints. Voice conversion (VC) is a technique for modifying speech acoustics, converting non-/para-linguistic information to any form we want while preserving the linguistic content. One of the most popular approaches to VC is based on statistical processing, which is capable of extracting a complex conversion function in a data-driven manner. Although this technique was originally studied in the context of speaker conversion, which converts the voice of a certain speaker to sound like that of another specific speaker, it has great potential to achieve various applications beyond speaker conversion. This paper briefly reviews a trajectory-based conversion method that is capable of effectively reproducing natural speech parameter trajectories utterance by utterance and highlights several techniques that extend this trajectory-based conversion method to achieve real-time conversion processing. Finally this paper shows some examples of real-time VC applications to enhance human-to-human speech communication, such as speaking-aid, silent speech communication, and voice changer/vocal effector.

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[7]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[8]  Hirokazu Kameoka,et al.  Generative modeling of speech F0 contours , 2013, INTERSPEECH.

[9]  Tomoki Toda,et al.  A postfilter to modify the modulation spectrum in HMM-based speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Masataka Goto,et al.  VOCALISTENER: A SINGING-TO-SINGING SYNTHESIS SYSTEM BASED ON ITERATIVE PARAMETER ESTIMATION , 2009 .

[11]  Tomoki Toda,et al.  Adaptive voice-quality control based on one-to-many eigenvoice conversion , 2010, INTERSPEECH.

[12]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[13]  Mikihiro Nakagiri,et al.  Statistical Voice Conversion Techniques for Body-Conducted Unvoiced Speech Enhancement , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[15]  J. M. Gilbert,et al.  Silent speech interfaces , 2010, Speech Commun..

[16]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[17]  Kou Tanaka,et al.  A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation , 2014, IEICE Trans. Inf. Syst..

[18]  Steve J. Young,et al.  Data-driven emotion conversion in spoken English , 2009, Speech Commun..

[19]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[20]  Tomoki Toda,et al.  Alaryngeal Speech Enhancement Based on One-to-Many Eigenvoice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Tomoki Toda,et al.  Voice Timbre Control Based on Perceived Age in Singing Voice Conversion , 2014, IEICE Trans. Inf. Syst..

[22]  K. Shikano,et al.  Statistical analysis of bilingual speaker's speech for cross-language voice conversion. , 1991, The Journal of the Acoustical Society of America.

[23]  Heiga Zen,et al.  Gaussian Process Experts for Voice Conversion , 2011, INTERSPEECH.

[24]  Takashi Nose,et al.  HMM-Based Voice Conversion Using Quantized F0 Context , 2010, IEICE Trans. Inf. Syst..

[25]  Tetsuya Takiguchi,et al.  Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments , 2013, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[26]  Tetsuya Takiguchi,et al.  Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines , 2014, IEICE Trans. Inf. Syst..

[27]  Kiyohiro Shikano,et al.  Non-Audible Murmur (NAM) Recognition , 2006, IEICE Trans. Inf. Syst..

[28]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[29]  Tomoki Toda,et al.  One-to-Many and Many-to-One Voice Conversion Based on Eigenvoices , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[30]  Ning Xu,et al.  Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data , 2014, Speech Commun..

[31]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Li-Rong Dai,et al.  Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Keiichi Tokuda,et al.  Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model , 2008, Speech Commun..

[36]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[37]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[38]  Bayya Yegnanarayana,et al.  Transformation of formants for voice conversion using artificial neural networks , 1995, Speech Commun..

[39]  Haizhou Li,et al.  Exemplar-Based Sparse Representation With Residual Compensation for Voice Conversion , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[40]  Tomoki Toda,et al.  A digital signal processor implementation of silent/electrolaryngeal speech enhancement based on real-time statistical voice conversion , 2013, INTERSPEECH.

[41]  Tomoki Toda,et al.  Singing voice conversion method based on many-to-many eigenvoice conversion and training data generation using a singing-to-singing synthesis system , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[42]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Tomoki Toda,et al.  Implementation of Computationally Efficient Real-Time Voice Conversion , 2012, INTERSPEECH.

[44]  Daniel Erro,et al.  Voice Conversion Based on Weighted Frequency Warping , 2010, IEEE Transactions on Audio, Speech, and Language Processing.