论文信息 - Learning Transposition-Invariant Interval Features from Symbolic Music and Audio

Learning Transposition-Invariant Interval Features from Symbolic Music and Audio

Many music theoretical constructs (such as scale types, modes, cadences, and chord types) are defined in terms of pitch intervals---relative distances between pitches. Therefore, when computer models are employed in music tasks, it can be useful to operate on interval representations rather than on the raw musical surface. Moreover, interval representations are transposition-invariant, valuable for tasks like audio alignment, cover song detection and music structure analysis. We employ a gated autoencoder to learn fixed-length, invertible and transposition-invariant interval representations from polyphonic music in the symbolic domain and in audio. An unsupervised training method is proposed yielding an organization of intervals in the representation space which is musically plausible. Based on the representations, a transposition-invariant self-similarity matrix is constructed and used to determine repeated sections in symbolic music and in audio, yielding competitive results in the MIREX task "Discovery of Repeated Themes and Sections".

Gerhard Widmer | Maarten Grachten | Stefan Lattner

[1] Roland Memisevic,et al. Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells" , 2014, NIPS.

[2] Geoffrey E. Hinton,et al. Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Roland Badeau,et al. Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Emilios Cambouropoulos,et al. A general pitch interval representation: Theory and applications , 1996 .

[5] Oriol Nieto,et al. Identifying Polyphonic Musical Patterns From Audio Recordings Using Music Segmentation Techniques , 2014, ISMIR.

[6] Urša Juvan. Discovery of Repeated Themes and Sections in Music with a Compositional Hierarchical Model , 2016 .

[7] David Meredith. COSIATEC and SIATECCompress: Pattern discovery by geometric compression , 2013 .

[8] Olivier Sigaud,et al. Deep unsupervised network for multimodal perception, representation and classification , 2015, Robotics Auton. Syst..

[9] Gerhard Widmer,et al. Discovering simple rules in complex data: A meta-learning algorithm and some surprising musical discoveries , 2003, Artif. Intell..

[10] Roland Memisevic,et al. On multi-view feature learning , 2012, ICML.

[11] Oriol Nieto,et al. IDENTIFYING POLYPHONIC PATTERNS FROM AUDIO RECORDINGS USING MUSIC SEGMENTATION TECHNIQUES , 2014 .

[12] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[13] G. Widmer,et al. Learning Transformations of Musical Material using Gated Autoencoders , 2017 .

[14] Joshua B. Tenenbaum,et al. Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[15] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[16] Olivier Sigaud,et al. Learning a repertoire of actions with deep neural networks , 2014, 4th International Conference on Development and Learning and on Epigenetic Robotics.

[17] Richard F. Lyon,et al. The Intervalgram: An Audio Feature for Large-scale Melody Recognition , 2012 .

[18] Haitham Bou-Ammar,et al. Factored four way conditional restricted Boltzmann machines for activity recognition , 2015, Pattern Recognit. Lett..

[19] Afshin Dehghan,et al. Who Do I Look Like? Determining Parent-Offspring Resemblance via Gated Autoencoders , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Geoffrey E. Hinton,et al. Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[21] Douwe Kiela,et al. Learning to Negate Adjectives with Bilinear Models , 2017, EACL.

[22] Christoph von der Malsburg,et al. The Correlation Theory of Brain Function , 1994 .

[23] Eita Nakamura,et al. Characteristics of Polyphonic Music Style and Markov Model of Pitch-Class Intervals , 2015, MCM.

[24] Roland Memisevic,et al. Learning invariant features by harnessing the aperture problem , 2013, ICML.

[25] Gerhard Widmer,et al. A Predictive Model for Music based on Learned Interval Representations , 2018, ISMIR.

[26] Shlomo Dubnov,et al. Music Pattern Discovery with Variable Markov Oracle: A Unified Approach to Symbolic and Audio Representations , 2015, ISMIR.

[27] Josh H McDermott,et al. Music Perception, Pitch, and the Auditory System This Review Comes from a Themed Issue on Sensory Systems Edited Pitch Relations across Time—relative Pitch Relative Pitch—behavioral Evidence Neural Mechanisms of Relative Pitch Representation of Simultaneous Pitches— Chords and Polyphony Summary and , 2022 .

[28] Honglak Lee,et al. Sparse deep belief net model for visual area V2 , 2007, NIPS.

[29] Christian Osendorfer,et al. Music Similarity Estimation with the Mean-Covariance Restricted Boltzmann Machine , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[30] Geraint A. Wiggins,et al. Algorithms for discovering repeated patterns in multidimensional representations of polyphonic music , 2002 .

[31] Bruno A. Olshausen,et al. Bilinear models of natural images , 2007, Electronic Imaging.

[32] Gerhard Widmer,et al. SIARCT-CFP: Improving Precision and the Discovery of Inexact Musical Patterns in Point-Set Representations , 2013, ISMIR.

[33] Andreas Arzt,et al. Audio-to-Score Alignment using Transposition-invariant Features , 2018, ISMIR.

[34] Matija Marolt,et al. A Mid-Level Representation for Melody-Based Retrieval in Audio Collections , 2008, IEEE Transactions on Multimedia.

[35] Meinard Müller,et al. Transposition-Invariant Self-Similarity Matrices , 2007, ISMIR.

[36] Roland Memisevic,et al. Gradient-based learning of higher-order image features , 2011, 2011 International Conference on Computer Vision.