Apprentissage automatique de représentation de voix à l’aide d’une distillation de la connaissance pour le casting vocal (Learning voice representation using knowledge distillation for automatic voice casting )

The search for professional voice-actors for audiovisual productions is a sensitive task, performed by the artistic directors (ADs). The ADs have a strong appetite for new talents/voices but cannot perform large scale auditions. Automatic tools able to suggest the most suited voices are of a great interest for audiovisual industry. In previous works, we showed the existence of acoustic information allowing to mimic the AD's choices. However, the only available information is the ADs' choices from the already dubbed multimedia productions. In this paper, we propose a representation-learning based strategy to build a character/role representation, called p-vector. In addition, the large variability between audiovisual productions makes difficult to have homogeneous training datasets. We overcome this difficulty by using knowledge distillation methods to take advantage of external datasets. Experiments are conducted on video-game voice excerpts. Results show a significant improvement using the p-vector, compared to the speaker-based x-vectors representation.

[1]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Ryo Masumura,et al.  Domain adaptation of DNN acoustic models using knowledge distillation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Jonathan Le Roux,et al.  Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Axel Röbel,et al.  Similarity Search of Acted Voices for Automatic Voice Casting , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[8]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[10]  Neethu Mariam Joy,et al.  Generalized Distillation Framework for Speaker Normalization , 2017, INTERSPEECH.

[11]  Tomoko Matsui,et al.  Robust Speech Recognition Using Generalized Distillation Framework , 2016, INTERSPEECH.

[12]  Koichi Shinoda,et al.  Wise teachers train better DNN acoustic models , 2016, EURASIP J. Audio Speech Music. Process..

[13]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[14]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[15]  Mickael Rouvier,et al.  Acoustic Pairing of Original and Dubbed Voices in the Context of Video Game Localization , 2017, INTERSPEECH.

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Axel Röbel,et al.  On automatic voice casting for expressive speech: Speaker recognition vs. speech classification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yifan Gong,et al.  Large-Scale Domain Adaptation via Teacher-Student Learning , 2017, INTERSPEECH.

[20]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Jean-François Bonastre,et al.  Similarity Metric Based on Siamese Neural Networks for Voice Casting , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).