Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features

In this paper, an approach for polyglot speech synthesis based on cross-lingual frame selection is proposed. This method requires only mono-lingual speech data of different speakers in different languages for building a polyglot synthesis system, thus reducing the burden of data collection. Essentially, a set of artificial utterances in the second language for a target speaker is constructed based on the proposed cross-lingual frame-selection process, and this data set is used to adapt a synthesis model in the second language to the speaker. In the cross-lingual frame-selection process, we propose to use auditory and articulatory features to improve the quality of the synthesized polyglot speech. For evaluation, a Mandarin-English polyglot system is implemented where the target speaker only speaks Mandarin. The results show that decent performance regarding voice identity and speech quality can be achieved with the proposed method.

[1]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Mark J. F. Gales Cluster adaptive training of hidden Markov models , 2000, IEEE Trans. Speech Audio Process..

[3]  Qi Li,et al.  An Auditory-Based Feature Extraction Algorithm for Robust Speaker Identification Under Mismatched Conditions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Hui Liang,et al.  VTLN adaptation for statistical speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Daniel Erro,et al.  Voice Conversion of Non-aligned Data using Unit Selection , 2006 .

[7]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[8]  R. Lyon Auditory Effects for ASR , 1996 .

[9]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Chin-Hui Lee,et al.  A penalized logistic regression approach to detection based phone classification , 2008, INTERSPEECH.

[11]  Chung-Hsien Wu,et al.  Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[14]  Yu Shi,et al.  Segmental tonal modeling for phone set design in Mandarin LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Zhi-Jie Yan,et al.  A Unified Trajectory Tiling Approach to High Quality Speech Rendering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Hui Liang,et al.  Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech , 2011, INTERSPEECH.

[17]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[18]  Roberto Togneri,et al.  An Auditory Motivated Asymmetric Compression Technique for Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Cai Rui TH-CoSS,a Mandarin Speech Corpus for TTS , 2007 .

[21]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[22]  Athanasios Mouchtaris,et al.  Nonparallel training for voice conversion based on a parameter adaptation approach , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  Yu Tsao,et al.  A study on detection based automatic speech recognition , 2006, INTERSPEECH.

[25]  Keiichi Tokuda,et al.  Speaker adaptation and the evaluation of speaker similarity in the EMIME speech-to-speech translation project , 2010, SSW.

[26]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[27]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[28]  Yoshihiko Nankaku,et al.  State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis , 2009, INTERSPEECH.

[29]  Hui Liang,et al.  Phonological Knowledge Guided HMM State Mapping for Cross-Lingual Speaker Adaptation , 2011, INTERSPEECH.

[30]  吉村 貴克,et al.  Simultaneous modeling of phonetic and prosodic parameters,and characteristic conversion for HMM-based text-to-speech systems , 2002 .

[31]  Hermann Ney,et al.  Text-independent cross-language voice conversion , 2006, INTERSPEECH.

[32]  Daniel Erro,et al.  INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Oliver Watts,et al.  Synthesis of Child Speech With HMM Adaptation and Voice Conversion , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[35]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Sadaoki Furui,et al.  New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[37]  Hui Ye,et al.  Voice conversion for unknown speakers , 2004, INTERSPEECH.

[38]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[39]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[40]  R.A. Cole,et al.  Speaker-independent vowel recognition: spectrograms versus cochleagrams , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[41]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[42]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[43]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Beat Pfister,et al.  From multilingual to polyglot speech synthesis , 1999, EUROSPEECH.

[45]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[46]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.