A plethora of different onset detection methods have been proposed in the recent years. However few attempts have been made with regard to widely-applicable approaches in order to achieve superior performances over different types of music and with considerable temporal precision. This paper concerns the usage of Wavelet Packet Transform in order to exploits multi-resolution time-frequency features. We apply early fusion in the feature space by combining Wavelet Packet Energy Coefficients and auditory spectral features. The features are then processed by a bidirectional Long Short-Term Memory recurrent neural network, acting as reduction function. The network is trained with a large database of onset data covering various genres and onset types. Due to the data driven nature, our approach does not require the onset detection method and its parameters to be tuned to a particular type of music. 1. ALGORITHM DESCRIPTION The algorithm can be seen divided in three parts. First, the audio data is transformed into the frequency domain via a Discrete Wavelet Packet Transform (DWPT) with 22 bands (cf. Table 1) and via two parallel STFTs with two different window sizes. Energy-based information and its evolution over time are used as the final feature set. Second, the features are used as inputs to the BLSTM network, which produces an onset activation function as output. Finally, the network output is post-processed by a thresholding and peak picking methods in order to obtain the correct position of the onsets. Figure 1 shows this procedure. The individual blocks are described in more detail in the following sections. 1.1 Feature Extraction Discrete input audio files, sampled at Fs = 44.1kHz, have been used for our experiments. This document is licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 License. http://creativecommons.org/licenses/by-nc-sa/3.0/ c © 2013 The Authors. Auditory Spectral Feat. WPEC BLSTM Network Peak detection Signal Onsets Figure 1. General block scheme. A new features set is obtained exploiting wavelet transformation (cf. Figure 2) by obtaining Wavelet Packet Energy Coefficients (WPEC). The discrete input audio signal is segmented into overlapping frames ofW46 = 2048 samples, which are sampled at a rate of 100 fps, log-energy of each frame is calculated before applying the Hamming window following:
[1]
Joseph Timoney,et al.
Real-time detection of musical onsets with linear prediction and sinusoidal modeling
,
2011,
EURASIP J. Adv. Signal Process..
[2]
Mark B. Sandler,et al.
A tutorial on onset detection in music signals
,
2005,
IEEE Transactions on Speech and Audio Processing.
[3]
Jürgen Schmidhuber,et al.
Long Short-Term Memory
,
1997,
Neural Computation.
[4]
Gaël Richard,et al.
Methodology and Tools for the evaluation of automatic onset detection algorithms in music
,
2004,
ISMIR.
[5]
Björn W. Schuller,et al.
Universal Onset Detection with Bidirectional Long Short-Term Memory Neural Networks
,
2010,
ISMIR.
[6]
Alex Graves,et al.
Supervised Sequence Labelling with Recurrent Neural Networks
,
2012,
Studies in Computational Intelligence.