Audio identification based on spectral modeling of bark-bands energy and synchronization through onset detection

In this paper, we present for the first time the fingerprint IRCAM system for audio identification in streams. The baseline system relies on a double-nested Short Time Fourier Transform. The first STFT computes the energies of a filter-bank, that are then modelled over 2 s, using a second STFT. We then present recent improvements of our system: first the inclusion of perceptual scales for amplitude and frequency (Bark bands), then the synchronization of stream and database frames using an onset detection system. The performance of these improvements is tested on a large set of real audio streams. We compare our results with the results of re-implementations of the two state-of-the-art systems of Philips and Shazam.