Learning Complex Basis Functions for Invariant Representations of Audio

Learning features from data has shown to be more successful than using hand-crafted features for many machine learning tasks. In music information retrieval (MIR), features learned from windowed spectrograms are highly variant to transformations like transposition or time-shift. Such variances are undesirable when they are irrelevant for the respective MIR task. We propose an architecture called Complex Autoencoder (CAE) which learns features invariant to orthogonal transformations. Mapping signals onto complex basis functions learned by the CAE results in a transformation-invariant "magnitude space" and a transformation-variant "phase space". The phase space is useful to infer transformations between data pairs. When exploiting the invariance-property of the magnitude space, we achieve state-of-the-art results in audio-to-score alignment and repeated section discovery for audio. A PyTorch implementation of the CAE, including the repeated section discovery method, is available online.

[1]  Chen Chen,et al.  Gabor Convolutional Networks , 2018, WACV.

[2]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Richard F. Lyon,et al.  The Intervalgram: An Audio Feature for Large-scale Melody Recognition , 2012 .

[4]  Bruno A. Olshausen,et al.  Bilinear models of natural images , 2007, Electronic Imaging.

[5]  Qiang Qiu,et al.  Oriented Response Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Gerhard Widmer,et al.  Learning Transposition-Invariant Interval Features from Symbolic Music and Audio , 2018, ArXiv.

[7]  Gerhard Widmer,et al.  Discovering simple rules in complex data: A meta-learning algorithm and some surprising musical discoveries , 2003, Artif. Intell..

[8]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[9]  Gerhard Widmer,et al.  MATCH: A Music Alignment Tool Chest , 2005, ISMIR.

[10]  Joachim M. Buhmann,et al.  TI-POOLING: Transformation-Invariant Pooling for Feature Learning in Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Stéphane Mallat,et al.  Rotation, Scaling and Deformation Invariant Scattering for Texture Discrimination , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Xavier Serra,et al.  End-to-end Learning for Music Audio Tagging at Scale , 2017, ISMIR.

[13]  Gerhard Widmer,et al.  SIARCT-CFP: Improving Precision and the Discovery of Inexact Musical Patterns in Point-Set Representations , 2013, ISMIR.

[14]  Meinard Müller,et al.  Fundamentals of Music Processing , 2015, Springer International Publishing.

[15]  Andreas Arzt,et al.  Audio-to-Score Alignment using Transposition-invariant Features , 2018, ISMIR.

[16]  Daniel P. W. Ellis,et al.  Identifying `Cover Songs' with Chroma Features and Dynamic Programming Beat Tracking , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  S. Mallat,et al.  Invariant Scattering Convolution Networks , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Oriol Nieto,et al.  Identifying Polyphonic Musical Patterns From Audio Recordings Using Music Segmentation Techniques , 2014, ISMIR.

[19]  Benjamin Schrauwen,et al.  End-to-end learning for music audio , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Thierry Bertin-Mahieux,et al.  Large-Scale Cover Song Recognition Using the 2D Fourier Transform Magnitude , 2012, ISMIR.

[21]  George Tzanetakis,et al.  Polyphonic audio matching and alignment for music retrieval , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[22]  Lianwen Jin,et al.  High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[23]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[24]  Urša Juvan Discovery of Repeated Themes and Sections in Music with a Compositional Hierarchical Model , 2016 .

[25]  Stan Salvador,et al.  FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space , 2004 .

[26]  Gaël Richard,et al.  A comparative study of tonal acoustic features for a symbolic level music-to-score alignment , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[28]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Oriol Nieto,et al.  IDENTIFYING POLYPHONIC PATTERNS FROM AUDIO RECORDINGS USING MUSIC SEGMENTATION TECHNIQUES , 2014 .

[30]  Matija Marolt,et al.  A Mid-Level Representation for Melody-Based Retrieval in Audio Collections , 2008, IEEE Transactions on Multimedia.

[31]  Stephan J. Garbin,et al.  Harmonic Networks: Deep Translation and Rotation Equivariance , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Xiaogang Wang,et al.  Joint Deep Learning for Pedestrian Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Gerhard Widmer,et al.  Automatic Alignment of Music Performances with Structural Differences , 2013, ISMIR.

[34]  Roland Memisevic,et al.  Learning invariant features by harnessing the aperture problem , 2013, ICML.

[35]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[36]  Shlomo Dubnov,et al.  Music Pattern Discovery with Variable Markov Oracle: A Unified Approach to Symbolic and Audio Representations , 2015, ISMIR.

[37]  Nobutaka Shimada,et al.  Transform invariant auto-encoder , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).