Post Processing Music Similarity Computations

Today, among the best-performing algorithms for music similarity computations are algorithms based on Mel Frequency Cepstrum Coefficients (MFCCs). In these algorithms, each music track is modelled as a Gaussian Mixture Model (GMM) of MFCCs. The similarity between two tracks is computed by comparing their GMMs. As pointed out in [1, 2, 3], the distance space obtained this way has some undesirable properties. In this MIREX’06 submission, a technique has been implemented that aims to correct such anomalies to a certain extent 1 . The described algorithm ranked second (out of six) in the MIREX evaluation based on human listeners (note that the differences between the top-fiv e ranked algorithms are not statistically significant). Ther e is indication that it works better for artist identification th an the other submitted algorithms. 1. Feature Extraction and Basic Distance Computation The basic feature extraction process is quite similar to the one in [5]. It was chosen because its good tradeoff between runtime and quality, and because algorithms based on related techniques yielded good results in MIREX’05. • The input wave files (22.050 Hz sampling rate, mono) are divided into frames of 512 samples length, with 256 samples overlap, disregarding the first and last 30 seconds. • The number of frames corresponding to 2 minutes (i.e. 20.672 frames) are used for feature extraction. In the submitted algorithm, these frames are not chosen to be consecutive. Instead, the length of the wave data is divided into 20.672 fragments of equal length, and from each of those fragments, randomly 512 consecutive samples are chosen for feature extraction. By randomly choosing the frames possible aliasing effects with respect to the track’s meter are reduced. It seems that this approach yields better results than choosing the frames in a fully random manner, or taking all frames from the two minutes in the middle of the track. • From the chosen frames, 25 MFCCs are computed. 1 For more detailed evaluations, please refer to [4] • A song is represented as the overall mean of the MFCCs, and the full covariance matrix. The feature extraction process was implemented using the MA-Toolbox ([6]). Two songs are compared by the KullbackLeiber (KL) distance. If the inverse of a song’s covariance matrix can not be found, it is assumed that it is dissimilar to all other songs. One drawback of this technique is that it does not take into consideration the temporal order of frames, thus aspec t related to time are not modelled. An approach to add timedependent features is propsed in [2]. However, the version used here it is a good starting point for the post-processing step described in the next section.