Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

Taking long-term spectral and temporal dependencies into account is essential for automatic piano transcription. This is especially helpful when determining the precise onset and offset for each note in the polyphonic piano content. In this case, we may rely on the capability of self-attention mechanism in Transformers to capture these long-term dependencies in the frequency and time axes. In this work, we propose hFT-Transformer, which is an automatic music transcription method that uses a two-level hierarchical frequency-time Transformer architecture. The first hierarchy includes a convolutional block in the time axis, a Transformer encoder in the frequency axis, and a Transformer decoder that converts the dimension in the frequency axis. The output is then fed into the second hierarchy which consists of another Transformer encoder in the time axis. We evaluated our method with the widely used MAPS and MAESTRO v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the F1-scores of the metrics among Frame, Note, Note with Offset, and Note with Offset and Velocity estimations.

[1]  Yi Yu,et al.  HPPNet: Modeling the Harmonic Structure and Pitch Invariance in Piano Transcription , 2022, ISMIR.

[2]  Emmanouil Benetos,et al.  Exploring Transformer’s Potential on Automatic Piano Transcription , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yi-Hsuan Yang,et al.  Towards Automatic Transcription of Polyphonic Electric Guitar Music: A New Dataset and a Multi-Loss Transformer Model , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Jesse Engel,et al.  MT3: Multi-Task Multitrack Music Transcription , 2021, ICLR.

[5]  Edward Z. Yang,et al.  Torchaudio: Building Blocks for Audio and Speech Processing , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Minz Won,et al.  SpecTNT: a Time-Frequency Transformer for Music Audio , 2021, ISMIR.

[7]  Curtis Hawthorne,et al.  Sequence-to-Sequence Piano Transcription with Transformers , 2021, ISMIR.

[8]  Emmanouil Benetos,et al.  The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[9]  Qiuqiang Kong,et al.  High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Juhan Nam,et al.  Polyphonic Piano Transcription Using Autoregressive Multi-State Note Model , 2020, ISMIR.

[11]  Juan Pablo Bello,et al.  Adversarial Learning for Improved Onsets and Frames Music Transcription , 2019, ISMIR.

[12]  Gerhard Widmer,et al.  Deep Polyphonic ADSR Piano Note Transcription , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Simon Dixon,et al.  Automatic Music Transcription: An Overview , 2019, IEEE Signal Processing Magazine.

[14]  Cheng-Zhi Anna Huang,et al.  Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset , 2018, ICLR.

[15]  Colin Raffel,et al.  Onsets and Frames: Dual-Objective Piano Transcription , 2017, ISMIR.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Gerhard Widmer,et al.  On the Potential of Simple Framewise Approaches to Piano Transcription , 2016, ISMIR.

[18]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Aaron C. Courville,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Frank Cwitkowitz,et al.  Skipping the Frame-Level: Event-Based Piano Transcription With Neural Semi-CRFs , 2021, NeurIPS.

[25]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.