Automatic Singing Transcription Based on Encoder-decoder Recurrent Neural Networks with a Weakly-supervised Attention Mechanism

This paper describes neural singing transcription that estimates a sequence of musical notes directly from the audio signal of singing voice in an end-to-end manner without time-aligned training data. A conventional approach to singing transcription is to perform vocal F0 estimation followed by musical note estimation. The performance of this approach, however, is severely limited because the F0 estimation errors propagate to the note estimation step and rich acoustic information cannot be used. In addition, it is difficult and time-consuming to split continuous signals of singing voices into segments corresponding to musical notes for making precise time-aligned transcriptions. To solve these problems, we use an encoder-decoder model with an attention mechanism that can automatically learn an input-output alignment and mapping, even from non-aligned training data. The main challenge of our study is to estimate temporal categories (note values) in addition to instantaneous categories (pitches). We thus propose a novel loss function for the attention weights of time-aligned notes for semi-supervised alignment training. By gradually reducing the weight of the loss function, a better input-output alignment can be learned much more quickly. We showed that our method performed well for isolated singing voice in popular music.

[1]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[2]  Jordan B. L. Smith,et al.  Probabilistic transcription of sung melody using a pitch dynamic model , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Masataka Goto,et al.  Scale- and Rhythm-Aware Musical Note Estimation for Vocal F0 Trajectories Based on a Semi-Tatum-Synchronous Hierarchical Hidden Semi-Markov Model , 2017, ISMIR.

[4]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[5]  Masataka Goto,et al.  RWC Music Database: Popular, Classical and Jazz Music Databases , 2002, ISMIR.

[6]  Katsutoshi Itoyama,et al.  Singing voice analysis and editing based on mutually dependent F0 estimation and source separation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  D. J. Hermes,et al.  Measurement of pitch by subharmonic summation. , 1988, The Journal of the Acoustical Society of America.

[8]  Gaël Richard,et al.  Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Hiromasa Fujihara,et al.  Songle: A Web Service for Active Music Listening Improved by User Contributions , 2011, ISMIR.

[11]  DeLiang Wang,et al.  Separation of singing voice from music accompaniment for monaural recordings , 2007 .

[12]  Simon Dixon,et al.  PYIN: A fundamental frequency estimator using probabilistic threshold distributions , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Masataka Goto,et al.  A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals , 2004, Speech Commun..

[14]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[15]  Masataka Goto,et al.  A Learning-Based Quantization: Unsupervised Estimation of the Model Parameters , 2003, ICMC.

[16]  Tara N. Sainath,et al.  An Analysis of "Attention" in Sequence-to-Sequence Models , 2017, INTERSPEECH.

[17]  Ryo Nishikimi,et al.  Probabilistic Sequential Patterns for Singing Transcription , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[18]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Tomoshi Otsuki,et al.  Hidden Markov model for automatic transcription of MIDI signals , 2002, 2002 IEEE Workshop on Multimedia Signal Processing..

[21]  Emilia Gómez,et al.  Automatic Transcription of Flamenco Singing From Polyphonic Music Recordings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Masataka Goto,et al.  AIST Annotation for the RWC Music Database , 2006, ISMIR.

[23]  Jordi Bonada,et al.  A Neural Parametric Singing Synthesizer Modeling Timbre and Expression from Natural Songs , 2017 .

[24]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[25]  Mark Steedman,et al.  Multi-Pitch Detection and Voice Assignment for A Cappella Recordings of Multiple Singers , 2017, ISMIR.

[26]  Emilio Molina,et al.  SiPTH: Singing Transcription Based on Hysteresis Defined on the Pitch-Time Curve , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Christopher Raphael,et al.  A hybrid graphical model for rhythmic parsing , 2002, Artif. Intell..

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Paris Smaragdis,et al.  Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Emilia Gómez,et al.  Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics , 2012, IEEE Transactions on Audio, Speech, and Language Processing.