Deep Autotuner: A Data-Driven Approach to Natural-Sounding Pitch Correction for Singing Voice in Karaoke Performances

We describe a machine-learning approach to pitch correcting a solo singing performance in a karaoke setting, where the solo voice and accompaniment are on separate tracks. The proposed approach addresses the situation where no musical score of the vocals nor the accompaniment exists: It predicts the amount of correction from the relationship between the spectral contents of the vocal and accompaniment tracks. Hence, the pitch shift in cents suggested by the model can be used to make the voice sound in tune with the accompaniment. This approach differs from commercially used automatic pitch correction systems, where notes in the vocal tracks are shifted to be centered around notes in a user-defined score or mapped to the closest pitch among the twelve equal-tempered scale degrees. We train the model using a dataset of 4,702 amateur karaoke performances selected for good intonation. We present a Convolutional Gated Recurrent Unit (CGRU) model to accomplish this task. This method can be extended into unsupervised pitch correction of a vocal performance, popularly referred to as autotuning.

[1]  Juhan Nam,et al.  Singing Expression Transfer from One Voice to Another for a Given Song , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Vincent J. Kantorski String Instrument Intonation in Upper and Lower Registers: The Effects of Accompaniment , 1986 .

[4]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[5]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[6]  Ichiro Fujinaga,et al.  Intonation in solo vocal performance: A study of semitone and whole tone tuning in undergraduate and professional sopranos , 2011 .

[7]  Max Schoen,et al.  PITCH AND VIBRATO IN ARTISTIC SINGING AN EXPERIMENTAL STUDY , 1926 .

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Justin Salamon,et al.  Deep Salience Representations for F0 Estimation in Polyphonic Music , 2017, ISMIR.

[10]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[11]  J. Barbour,et al.  JUST INTONATION CONFUTED , 1938 .

[12]  George Tzanetakis,et al.  Intonation: A Dataset of Quality Vocal Performances Refined by Spectral Clustering on Pitch Congruence , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Slim Essid,et al.  MAIN MELODY EXTRACTION WITH SOURCE-FILTER NMF AND CRNN , 2018 .

[14]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[15]  Li Su,et al.  Singing Voice Correction Using Canonical Time Warping , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Qiang Huang,et al.  Convolutional gated recurrent neural network incorporating spatial features for audio tagging , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[17]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[20]  Emilia Gómez,et al.  Deep Learning for Singing Processing: Achievements, Challenges and Impact on Singers and Listeners , 2018, IJCAI 2018.

[21]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[22]  Graham Hair,et al.  A Psychocultural Theory of Musical Interval: Bye Bye Pythagoras , 2018 .

[23]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).