Learning to Say It Well: Reranking Realizations by Predicted Synthesis Quality

This paper presents a method for adapting a language generator to the strengths and weaknesses of a synthetic voice, thereby improving the naturalness of synthetic speech in a spoken language dialogue system. The method trains a discriminative reranker to select paraphrases that are predicted to sound natural when synthesized. The ranker is trained on realizer and synthesizer features in supervised fashion, using human judgements of synthetic voice quality on a sample of the paraphrases representative of the generator's capability. Results from a cross-validation study indicate that discriminative paraphrase reranking can achieve substantial improvements in naturalness on average, ameliorating the problem of highly variable synthesis quality typically encountered with today's unit selection synthesizers.

[1]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[2]  Chris Mellish,et al.  On the use of automatically generated discourse-level information in a concept-to-speech synthesis system , 1998, ICSLP.

[3]  Michael White,et al.  Efficient Realization of Coordinate Structures in Combinatory Categorial Grammar , 2006 .

[4]  Michael White CCG Chart Realization from Disjunctive Inputs , 2006, INLG.

[5]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[6]  Kevin Knight,et al.  Generation that Exploits Corpus-Based Statistical Knowledge , 1998, ACL.

[7]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[8]  Alan W. Black,et al.  Optimal data selection for unit selection synthesis , 2001, SSW.

[9]  Simon King,et al.  Festival 2 - build your own general purpose unit selection speech synthesiser , 2004, SSW.

[10]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[11]  Julia Hirschberg,et al.  Exploring features from natural language generation for prosody modeling , 2002, Comput. Speech Lang..

[12]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[13]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[14]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  Mark Steedman,et al.  Specifying intonation from context for speech synthesis , 1994, Speech Communication.

[16]  Daniel Marcu,et al.  Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences , 2003, NAACL.

[17]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[18]  Mary Ellen Foster,et al.  Techniques for Text Planning with XSLT , 2004, NLPXML@ACL.

[19]  Marilyn A. Walker,et al.  Training a sentence planner for spoken dialogue using boosting , 2002, Comput. Speech Lang..

[20]  Julia Hirschberg,et al.  Assigning Intonational Features in Synthesized Spoken Directions , 1988, ACL.

[21]  Matthew Stone,et al.  Speaking with hands: creating animated conversational characters from recordings of human performance , 2004, ACM Trans. Graph..

[22]  Alain Polguère,et al.  Lexical Selection and Paraphrase in a Meaning-Text Generation Model , 1991 .

[23]  Mirella Lapata,et al.  Modeling Local Coherence: An Entity-Based Approach , 2005, ACL.

[24]  Michael White,et al.  Reining in CCG Chart Realization , 2004, INLG.

[25]  Louis Boves,et al.  Towards Ambient Intelligence: Multimodal Computers that understand our intentions , 2003 .

[26]  Mary Ellen Foster,et al.  Assessing the Impact of Adaptive Generation in the COMIC Multimodal Dialogue System , 2005 .

[27]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[28]  Mari Ostendorf,et al.  Efficient integrated response generation from multiple targets using weighted finite state transducers , 2002, Comput. Speech Lang..

[29]  Shimei Pan,et al.  Designing a Speech Corpus for Instance-based Spoken Language Generation , 2002, INLG.