Singing Voice Conversion Using Posted Waveform Data on Music Social Media

This paper proposes a method of selecting training data for many-to-one singing voice conversion (VC) from waveform data on the social media music app “nana.” On this social media app, users can share sounds such as speaking, singing, and instrumental music recorded by their smartphones. The number of hours of accumulated waveform data has exceeded one million, and it is regarded as “big data.” It is widely known that big data can create huge values by advanced deep learning technology. A lot of post data of multiple users having sung the same song is contained in nana's database. This data is considered suitable training data for VC. This is because VC frameworks based on statistical approaches often require parallel data sets that consist of pairs of waveform data of source and target singers who sing the same phrases. The proposed method can compose parallel data sets that can be used for many-to-one statistical VCs from nana's database by extracting frames that have small differences in the timing of utterances, based on the results of dynamic programming (DP) matching. Experimental results indicate that a system that uses training data composed by our method can convert acoustic features more accurately than a system that does not use the method.

[1]  Zhaohui Wu,et al.  From Big Data to Great Services , 2016, 2016 IEEE International Congress on Big Data (BigData Congress).

[2]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[3]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[5]  Mark J. F. Gales,et al.  Cambridge university transcription systems for the multi-genre broadcast challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[6]  K. Yoshida,et al.  Online Handwritten Character Recognition for a Personal Computer System , 1982, IEEE Transactions on Consumer Electronics.

[7]  Raymond W. M. Ng,et al.  The 2015 sheffield system for transcription of Multi-Genre Broadcast media , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[8]  Kishore Prahallad,et al.  Voice conversion using Artificial Neural Networks , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[11]  Eric Moulines,et al.  Continuous probabilistic transform for voice conversion , 1998, IEEE Trans. Speech Audio Process..

[12]  Mark J. F. Gales,et al.  Speaker diarisation and longitudinal linking in multi-genre broadcast data , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  Yoshihiko Nankaku,et al.  Voice Conversion Based on Trajectory Model Training of Neural Networks Considering Global Variance , 2016, INTERSPEECH.

[14]  Eduardo Lleida,et al.  Variational Bayesian PLDA for speaker diarization in the MGB challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.