High-dimensional sequence transduction

We investigate the problem of transforming an input sequence into a high-dimensional output sequence in order to transcribe polyphonic audio music into symbolic notation. We introduce a probabilistic model based on a recurrent neural network that is able to learn realistic output distributions given the input and we devise an efficient algorithm to search for the global mode of that distribution. The resulting method produces musically plausible transcriptions even under high levels of noise and drastically outperforms previous state-of- the-art approaches on five datasets of synthesized sounds and real recordings, approximately halving the test error rate.

[1]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[2]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Samy Bengio,et al.  Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks , 1999, NIPS.

[5]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[6]  Guy J. Brown,et al.  Techniques for handling convolutional distortion with 'missing data' automatic speech recognition , 2004, Speech Commun..

[7]  Matija Marolt,et al.  A connectionist approach to automatic transcription of polyphonic piano music , 2004, IEEE Transactions on Multimedia.

[8]  Eric A. Hansen,et al.  Beam-Stack Search: Integrating Backtracking with Beam Search , 2005, ICAPS.

[9]  M.P. Ryynanen,et al.  Polyphonic music transcription using note event modeling , 2005, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2005..

[10]  David Barber,et al.  A generative model for music transcription , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Daniel P. W. Ellis,et al.  A Discriminative Model for Polyphonic Piano Transcription , 2007, EURASIP J. Adv. Signal Process..

[12]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[13]  Geoffrey E. Hinton,et al.  The Recurrent Temporal Restricted Boltzmann Machine , 2008, NIPS.

[14]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[15]  Rainer Lienhart,et al.  Note onset detection for the transcription of polyphonic piano music , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[16]  Mert Bay,et al.  Evaluation of Multiple-F0 Estimation and Tracking Systems , 2009, ISMIR.

[17]  Wheeler Ruml,et al.  The Joy of Forgetting: Faster Anytime Search via Restarting , 2010, ICAPS.

[18]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[19]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[20]  Juhan Nam,et al.  A Classification-Based Polyphonic Piano Transcription Approach Using Learned Feature Representations , 2011, ISMIR.

[21]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[22]  Yoshua Bengio,et al.  Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[23]  Markus Schedl,et al.  Polyphonic piano note transcription with recurrent neural networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.