Generating multiple-accent pronunciations for TTS using joint sequence model interpolation

Standard grapheme-to-phoneme (G2P) systems are trained using a homogeneous lexicon, for example one associated with a particular accent. In practice, a synthesis system may be required to handle multiple accents. Furthermore, a speaker rarely has a pure accent; accents vary continuously within and between regions of a country. Generating phonetic sequences for each accent is possible, but combining them to yield a single synthesis pronunciation is highly challenging. To address this problem, this paper considers a space of accents. The bases for these spaces are defined by statistical G2P models in the form of graphone models. A linear combination of these models define the accent space. By selecting a point in this continuous space, it is possible to specify the accent for an individual speaker. The performance of this approach is evaluated using an accent space defined by American, Scottish and British English. By moving around the accent space, it is shown that it is possible to synthesize speech from all these accents as well as a range of intermediate points. Index Terms: phonetic sequence generation, accent space, interpolation

[1]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[2]  John Nerbonne,et al.  Dialect areas and dialect continua , 2001, Language Variation and Change.

[3]  Roger K. Moore,et al.  C2H: A Computational Model of H&H-based Phonetic Contrast in Synthetic Speech , 2012, INTERSPEECH.

[4]  Mark J. F. Gales,et al.  Graphone Model Interpolation and Arabic Pronunciation Generation , 2011, INTERSPEECH.

[5]  Susan Fitt,et al.  On generating combilex pronunciations via morphological analysis , 2010, INTERSPEECH.

[6]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[7]  John C. Wells,et al.  Accents of English , 1982 .

[8]  Mark J. F. Gales,et al.  Building HMM-TTS Voices on Diverse Data , 2014, IEEE Journal of Selected Topics in Signal Processing.

[9]  Sravana Reddy,et al.  G2P Conversion of Proper Names Using Word Origin Information , 2012, HLT-NAACL.

[10]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[11]  Mark J. F. Gales,et al.  Speech intonation for TTS: study on evaluation methodology , 2014, INTERSPEECH.

[12]  S. King,et al.  Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction , 2012 .

[13]  J. Harrington,et al.  Monophthongal vowel changes in Received Pronunciation: an acoustic analysis of the Queen's Christmas broadcasts , 2000, Journal of the International Phonetic Association.

[14]  Xiao Li,et al.  Adapting grapheme-to-phoneme conversion for name recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[15]  E. B. Andersen,et al.  Information Science and Statistics , 1986 .