Mixture of Gaussian process experts for predicting sung melodic contour with expressive dynamic fluctuations

We present a generative model for predicting the sung melodic contour, i.e., F0 contour, with expressive dynamic fluctuations, such as vibrato and portamento, for a given musical score. Although several studies have attempted to characterize such fluctuations, no systematic method has been developed for generating the F0 contour with them in connection with musical notes. In our model, the relationship between a musical note sequence and F0 contour is directly learned by a mixture of Gaussian process experts. This approach allows us to automatically characterize the fluctuations by utilizing the kernel function for each Gaussian process expert and predict the F0 contour for an arbitrary musical note sequence. Experimental results show that our model can better predict the F0 contour than a baseline method can. Additionally, we discuss the effective musical contexts and the amount of training data for the prediction.

[1]  John F. Michel,et al.  Vibrato and pitch transitions , 1987 .

[2]  Heiga Zen,et al.  Gaussian Process Experts for Voice Conversion , 2011, INTERSPEECH.

[3]  Masashi Unoki,et al.  Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis , 2005, Speech Commun..

[4]  Tomoki Toda,et al.  Evaluation of a singing voice conversion method based on many-to-many eigenvoice conversion , 2013, INTERSPEECH.

[5]  Pascal Poupart,et al.  Hierarchical Double Dirichlet Process Mixture of Gaussian Processes , 2012, AAAI.

[6]  T. Nishiura,et al.  A Study of Vibrato Features to Control Singing Voices , 2010 .

[7]  Simon Osindero,et al.  An Alternative Infinite Mixture Of Gaussian Process Experts , 2005, NIPS.

[8]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[9]  Yoshihiko Nankaku,et al.  Pitch adaptive training for hmm-based singing voice synthesis , 2014, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  I. Nakayama,et al.  Comparative studies on vocal expressions in Japanese traditional and Western classical-style singing using common verse , 2004 .

[11]  Masataka Goto,et al.  Speech-to-Singing Synthesis: Converting Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[12]  Takashi Nose,et al.  Statistical Parametric Speech Synthesis Based on Gaussian Process Regression , 2014, IEEE Journal of Selected Topics in Signal Processing.

[13]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[14]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[15]  Y. Horii Acoustic analysis of vocal vibrato: A theoretical interpretation of data , 1989 .

[16]  Carl E. Rasmussen,et al.  Infinite Mixtures of Gaussian Process Experts , 2001, NIPS.

[17]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Bhuvana Ramabhadran,et al.  F0 contour prediction with a deep belief network-Gaussian process hybrid model , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[20]  Masataka Goto,et al.  Vocalistener2: A singing synthesis system able to mimic a user's singing in terms of voice timbre changes as well as pitch and dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  John Potter,et al.  Beggar at the Door: The Rise and Fall of Portamento in Singing , 2006 .

[22]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[23]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[24]  Shin-ichi Maeda,et al.  Gaussian Process Regression for Rendering Music Performance , 2008 .

[25]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[26]  Gustav Eje Henter,et al.  Gaussian process dynamical models for nonparametric speech representation and synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  P. Depalle,et al.  Perceptual Evaluation of Vibrato Models , 2005 .

[28]  Takashi Nose,et al.  A style control technique for singing voice synthesis based on multiple-regression HSMM , 2013, INTERSPEECH.

[29]  J. Sundberg,et al.  The Science of Singing Voice , 1987 .

[30]  Masataka Goto,et al.  An automatic singing skill evaluation method for unknown melodies using pitch interval accuracy and vibrato features , 2006, INTERSPEECH.

[31]  Masataka Goto,et al.  AIST Annotation for the RWC Music Database , 2006, ISMIR.

[32]  Gustav Eje Henter,et al.  Proc. Interspeech 2012 , 2012, Interspeech.

[33]  Hideki Kenmochi,et al.  Singing synthesis as a new musical instrument , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Hirokazu Kameoka,et al.  A Stochastic Model of Singing Voice F0 Contours for Characterizing Expressive Dynamic Components , 2012, INTERSPEECH.

[36]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[37]  Heidi Christensen,et al.  Proc Interspeech 2013 , 2013, ISCA 2013.

[38]  L. Regnier Localization, Characterization and Recognition of Singing Voices , 2012 .

[39]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[40]  Emanuele Pollastri Some Considerations About Processing Singing Voice for Music Retrieval , 2002, ISMIR.

[41]  Ehud Weinstein,et al.  Parameter estimation of superimposed signals using the EM algorithm , 1988, IEEE Trans. Acoust. Speech Signal Process..

[42]  Masuzo Yanagida,et al.  Variability of Vibrato-A Comparative Study between Japanese Traditional Singing and , 2004 .

[43]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[44]  Masataka Goto,et al.  Infinite kernel linear prediction for joint estimation of spectral envelope and fundamental frequency , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.