Synthesising Expressiveness in Peking Opera via Duration Informed Attention Network

This paper presents a method that generates expressive singing voice of Peking opera. The synthesis of expressive opera singing usually requires pitch contours to be extracted as the training data, which relies on techniques and is not able to be manually labeled. With the Duration Informed Attention Network (DurIAN), this paper makes use of musical note instead of pitch contours for expressive opera singing synthesis. The proposed method enables human annotation being combined with automatic extracted features to be used as training data thus the proposed method gives extra flexibility in data collection for Peking opera singing synthesis. Comparing with the expressive singing voice of Peking opera synthesised by pitch contour based system, the proposed musical note based system produces comparable singing voice in Peking opera with expressiveness in various aspects.

[1]  Ryo Nishikimi,et al.  Sequential Generation of Singing F0 Contours from Musical Note Sequences Based on WaveNet , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[2]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[3]  Mike Lewis,et al.  MelNet: A Generative Model for Audio in the Frequency Domain , 2019, ArXiv.

[4]  Rong Gong,et al.  Jingju A Cappella Singing Dataset Part1 , 2017 .

[5]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[6]  Ritu Sharma Speech Synthesis , 2019, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[7]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Li-Rong Dai,et al.  Singing Voice Synthesis Using Deep Autoregressive Neural Networks for Acoustic Modeling , 2019, INTERSPEECH.

[9]  Rong Gong,et al.  Jingju A Cappella Singing Dataset Part2 , 2017 .

[10]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[11]  E. Paulus,et al.  Speech Signal Processing , 1997, The Electrical Engineering Handbook - Six Volume Set.

[12]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network For Multimodal Synthesis , 2019, ArXiv.

[13]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[14]  Xavier Serra,et al.  Pitch contour segmentation for computer-aided jinju singing training , 2016 .

[15]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Yoshihiko Nankaku,et al.  Recent Development of the DNN-based Singing Voice Synthesis System — Sinsy , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[17]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[18]  Emilia Gómez,et al.  WGANSing: A Multi-Voice Singing Voice Synthesizer Based on the Wasserstein-GAN , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[19]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs , 2017 .

[20]  Pierre Lanchantin,et al.  Enhanced Virtual Singers Generation by Incorporating Singing Dynamics to Personalized Text-to-speech-to-singing , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  John Nicholas Holmes,et al.  Speech synthesis , 1972 .