Towards an Unsupervised Speaking Style Voice Building Framework: Multi-Style Speaker Diarization

Current text‐to‐speech systems are developed using studio-recorded speech in a neutral style or based on acted emotions. However, the proliferation of media sharing sites would allow developing a new generation of speech‐based systems which could cope with spontaneous and styled speech. This paper proposes an architecture to deal with realistic recordings and carries out some experiments on unsupervised speaker diarization. In order to maximize the speaker purity of the clusters while keeping a high speaker coverage, the paper evaluates the F‐measure of a diarization module, achieving high scores (>85%) especially when the clusters are longer than 30 seconds, even for the more spontaneous and expressive styles (such as talk shows or sports). Index Terms: expressive speech synthesis, speaker diarization, speaking styles, voice cloning.

[1]  Heiga Zen,et al.  Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV , 2007, SSW.

[2]  Antonio Moreno-Sandoval,et al.  The C-ORAL-ROM CORPUS. A Multilingual Resource of Spontaneous Speech for Romance Languages , 2004, LREC.

[3]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[5]  Panayiotis G. Georgiou,et al.  SailAlign: Robust long speech-text alignment , 2011 .

[6]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[7]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[8]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[9]  Sylvain Meignier,et al.  LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[10]  Rubén San-Segundo-Hernández,et al.  Speaker Diarization Features: The UPM Contribution to the RT09 Evaluation , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Ascensión Gallardo Antolín,et al.  UPM-UC3M system for music and speech segmentation , 2010 .

[13]  Karim Filali,et al.  Frontend post-processing and backend model enhancement on the Aurora 2.0/3.0 databases , 2002, INTERSPEECH.

[14]  Kishore Prahallad,et al.  Segmentation of Monologues in Audio Books for Building Synthetic Voices , 2011, IEEE Transactions on Audio, Speech, and Language Processing.