While frame-level audio features, e.g. MFCCs, in combination with the bag-of-frames approach have widely and successfully been used, we use a block processing framework in our submission. In general block-level features have the advantage that they can capture more temporal information than BOF approaches can. We introduce two novel spectral patterns, closely related to the spectrum histogram and propose a modified version of the well-known fluctuation patterns. Based on these patterns we train a support vector machine to classify songs into different categories. 1. AUDIO PREPROCESSING We use the Java based audio signal analysis toolbox CoMIRVA (Collection of Music Information Retrieval and Visualization Applications) [1]. This library takes care of decode and resample any input audio file to 22 kHz raw PCM. A maximum of four minutes starting from the beginning of an audio file are decoded and the central two minutes of the decoded audio signal are analyzed per audio file. To analyze the audio signal it is transformed to the frequency domain by applying a Short Time Fourier Transform (STFT) using a window size of 2048 samples, a hop size of 512 samples and a Hanning window. Finally, we compute the magnitude spectrum thereof. 1.1 Cent-Scale We especially account for the musical nature of the audio signals by mapping the magnitude spectrum with linear frequency resolution onto a logarithmical musical scale, the Cent-scale [7]. We do so by simply summing all frequency bins of the magnitude spectrum with linear frequency resolution within a constant bandwidth of 100 cent starting from 2050 cent (equal to about 53.43 Hz). The resulting spectral feature vectors still have 97 dimensions. This results in a linear frequency resolution up to about 430 Hz and starts compressing the higher frequency content thereafter in a logarithmic way (see figure 1). We transform the compressed magnitude spectrum according to the above equation to obtain a logarithmic scale. Altogether, the mapping onto the Cent-scale is a fast approximation of a constant-Q transform, but with constant window length for all frequency bins. Figure 1 Spectrogram with linear frequency resolution (upper illustration) and the cent-scaled equivalent (lower illustration). 1.2 Audio Normalization Audio files are recorded at different volume levels. From a technical point of view this means that the whole audio signal is amplified by a constant factor The magnitude spectrum of the amplified signal is also scaled by the constant factor as the Fourier transform is a linear transformation. As we process all audio blocks based on a logarithmic amplitude scale (in dB), the amplified magnitude spectrum (in dB) is offset by a constant. For some features can be advantageous to be loudness invariant. Thus, we perform an audio normalization. In some audio applications this is achieved by a simple frame by frame mean removal. Removing the mean of each frame of course makes the spectral representation invariant to the constant offset. However, the local loudness information is lost, as all frames will have zero mean. The only information left is the spectral envelope of the audio frame. To keep some local loudness information but still make the whole audio signal loudness invariant the constant offset of a frame is estimated not just based on a single local frame, but using a fixed size neighborhood (in our experiments we use ±100 frames) around each frame. From each frame we remove the mean of its neighborhood.
[1]
Peter Knees,et al.
The CoMIRVA Toolkit for Visualizing Music-Related Data
,
2007,
EuroVis.
[2]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques, 3rd Edition
,
1999
.
[3]
Elias Pampalk,et al.
Content-based organization and visualization of music archives
,
2002,
MULTIMEDIA '02.
[4]
Masataka Goto,et al.
SmartMusicKIOSK: music listening station with chorus-search function
,
2003,
UIST '03.
[5]
Constantine Kotropoulos,et al.
Music Genre Classification: A Multilinear Approach
,
2008,
ISMIR.
[6]
George Tzanetakis,et al.
Musical genre classification of audio signals
,
2002,
IEEE Trans. Speech Audio Process..
[7]
Klaus Seyerlehner,et al.
FRAME LEVEL AUDIO SIMILARITY - A CODEBOOK APPROACH
,
2008
.