Listening to features

This work explores nonparametric methods which aim at synthesizing audio from low-dimensionnal acoustic features typically used in MIR frameworks. Several issues prevent this task to be straightforwardly achieved. Such features are designed for analysis and not for synthesis, thus favoring high-level description over easily inverted acoustic representation. Whereas some previous studies already considered the problem of synthesizing audio from features such as Mel-Frequency Cepstral Coefficients, they mainly relied on the explicit formula used to compute those features in order to inverse them. Here, we instead adopt a simple blind approach, where arbitrary sets of features can be used during synthesis and where reconstruction is exemplar-based. After testing the approach on a speech synthesis from well known features problem, we apply it to the more complex task of inverting songs from the Million Song Dataset. What makes this task harder is twofold. First, that features are irregularly spaced in the temporal domain according to an onset-based segmentation. Second the exact method used to compute these features is unknown, although the features for new audio can be computed using their API as a black-box. In this paper, we detail these difficulties and present a framework to nonetheless attempting such synthesis by concatenating audio samples from a training dataset, whose features have been computed beforehand. Samples are selected at the segment level, in the feature space with a simple nearest neighbor search. Additionnal constraints can then be defined to enhance the synthesis pertinence. Preliminary experiments are presented using RWC and GTZAN audio datasets to synthesize tracks from the Million Song Dataset.

[1]  Antoine Liutkus,et al.  Adaptive filtering for music/voice separation exploiting the repeating musical structure , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[3]  Riccardo Miotto,et al.  A Music Identification System Based on Chroma Indexing and Statistical Modeling , 2008, ISMIR.

[4]  Derry Fitzgerald Vocal separation using nearest neighbours and median filtering , 2012 .

[5]  Gaël Richard,et al.  A Scalable Audio Fingerprint Method with Robustness to Pitch-Shifting , 2011, ISMIR.

[6]  Masataka Goto,et al.  RWC Music Database: Music genre database and musical instrument sound database , 2003, ISMIR.

[7]  Gaël Richard,et al.  Robust frequency-based Audio Fingerprinting , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  F. Pachet,et al.  MUSICAL MOSAICING , 2001 .

[9]  Xu Shao,et al.  Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated front-end , 2006, Speech Commun..

[10]  Bryan Pardo,et al.  Music/Voice Separation Using the Similarity Matrix , 2012, ISMIR.

[11]  Xu Shao,et al.  Pitch prediction from MFCC vectors for speech reconstruction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Xu Shao,et al.  Speech reconstruction from mel-frequency cepstral coefficients using a source-filter model , 2002, INTERSPEECH.

[13]  Daniel P. W. Ellis,et al.  The Echo Nest Musical Fingerprint , 2010 .

[14]  George Tzanetakis,et al.  Sound analysis using MPEG compressed audio , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[15]  Avery Wang,et al.  An Industrial Strength Audio Search Algorithm , 2003, ISMIR.

[16]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[17]  Diemo Schwarz,et al.  A SYSTEM FOR DATA-DRIVEN CONCATENATIVE SOUND SYNTHESIS , 2000 .

[18]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[20]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[21]  Jeroen Breebaart,et al.  Features for audio and music classification , 2003, ISMIR.

[22]  Diemo Schwarz Concatenative sound synthesis: The early years , 2006 .