Selecting Groups of Audio Features by Statistical Tests and the Group Lasso

In this paper we aim at discriminating between two musical instruments by means of different groups of audio features, namely absolute amplitude envelope in the time domain as well as MFCC, pitchless periodogram and simplified spectral envelope in the spectral domain. For this task we utilize common statistical classification algorithms and perform statistical tests to evaluate whether the discriminating power of certain subsets of feature groups dominates other group subsets. We also examine if it is possible to directly select a useful set of groups by applying logistic regression regularized by a group lasso penalty structure. Specifically, we apply our methods to a data set of single piano and guitar tones. 1 Description of Features Each single tone consists of an audio signal x[n], n ∈ {1, . . . ,N}, of length 1.2s at sampling rate sr = 44100 Hz. From this the following four feature vectors are calculated. In general, the signal is windowed by half-overlapping segments ws, s ∈ {1, . . . ,25} of a size of 4096 samples. 1.1 Absolute Amplitude Envelope To take the upper and lower part of the envelope into account the absolute values |x[n]| define the so-called absolute amplitude envelope e ∈ IR1×132 by setting l = ⌊ N 400 ⌋ · 400 as follows: e = ( max 1≤i≤400 {|x[i]|}, max 401≤i≤800 {|x[i]|}, . . . , max l−399≤i≤l {|x[i]|} ) . Note that here non-overlapping segments of size 400 are used. 1.2 Pitchless Periodogram The periodogram P of each window is calculated at fixed frequencies {X1, . . . ,X2048}, sr/2 2048 ≤ xi ≤ sr 2 . Additionally for each window the fundamental frequency is estimated (called f̂0) so that overtones can be calculated as f̂i = (i+1) · f̂0, i ∈ {0, . . . ,13}. For each fixed f̂i and each window ws the periodogram values, i.e. the squared values of the DFT, Ps f̂i (xi), with | f̂i−X i|= min 1≤ j≤2048 | f̂i−X j| ∀s ∈ {1, . . . ,25}, ∀i ∈ {0, . . . ,13} are calculated. Medians of blocks of five subsequent time windows are considered: pi := median ( Pr f̂i (X i),Pr+1 f̂i (X i), . . . ,Pr+4 f̂i (X i) ) for i ∈ {0, . . . ,13} and r ∈ {1,6,11,16,21}. The Pitchless Periodogram v ∈ IR1×70 is then defined as v = ( p0, p 1 1, . . . , p 1 13, p 6 0, . . . , p 6 13, . . . , p 21 0 , . . . , p 21 13 ) . This is called ”’pitchless”’, because v is independent of the pitch and the distances X i+1−X i. 1.3 Mel Frequency Cepstral Coefficients The power spectrum is calculated by a DFT using Hamming windows and a subsequent log-transformation. After mapping the powers of the spectrum onto the mel scale by using triangular filters the discrete cosine transformation is applied yielding the MFCC coefficients. 1.4 LPC Simplified Spectral Envelope For each time window the coefficients of a pth-order linear predictor (FIR filter) are calculated with p = b2 + sr/1000c = 46 (rule of thumb of formant estimation). So the current value of the signal x[n] in segment k can be estimated by the past samples: x̂k(n) =−a2x (n−1)−a3x k(n−2)−·· ·−ap+1x k(n− p). The 512-points complex frequency response vector H of the filter can be interpreted as the transfer function evaluated at z = eiω : Hk(eiω) = ( p+1 ∑ l=1 al e −iωl )−1 , k ∈ {1, . . . ,25} ,a1 = 1 where al are the linear predictor coefficients. This frequency response is calculated for each time window k and so yields a matrix K ∈ IR512×25, with K·, j = 20log10 |H j|, j ∈ {1, . . . ,25}. With r ∈ {1,6,11,16,21} define vr := median(K·,r,K·,r+1,K·,r+2,K·,r+3,K·,r+4) . This yields V = ( v1,v6,v11,v16,v21 ) ∈ IR512×5. The Simplified LPC Spectral Envelope s ∈ IR1×125 is then the maximum of each subsequent 20 rows of V : s = ( max 1≤ j≤20 {Vj,1}, max 21≤ j≤40 {Vj,1}, . . . , max 501≤ j≤512 {Vj,1}, .. max 1≤ j≤20 {Vj,21}, max 21≤ j≤40 {Vj,21}, . . . , max 481≤ j≤501 {Vj,21} ) . 2 Statistical Modeling and Evaluation In order to identify which of the above groups are most useful to discriminate between tones of different musical instruments, we do not employ a usual feature selection algorithm. We are not primarily interested in an optimal set of features chosen arbitrarily across all groups, but rather want to statistically evaluate which of complete groups are most useful for our classification task at hand. To put it differently, we would like to identify a minimal set of groups classifying optimally. This does not only reduce runtime and storage requirements in applications, but also stabilizes the fitting process of classification models, as the number of features compared to the number of observations might be quite large. We follow a two-fold approach to achieve these objectives. 2.1 Testing Generalization Performance First, we employ the framework for benchmark experiments by Hothorn et al. [6] to compare the discriminating power of different sets of feature groups. By applying a resampling strategy like bootstrapping or subsampling one independently generates training sets from a given data set, uses a classification algorithm to fit models on these, predicts the out-of-bag test samples and measures their performance according to an appropriate loss function. This generates a population of performance values for every classifier, which now can be compared by using standard statistical inference methodology. But instead of the usual approach of fixing a certain set of features and then comparing the generalization performance of different kinds of classifiers, we fix the classifier and then vary the sets of features. We are generalizing a similar approach for a comparable setting in [14]. 2.2 Group Lasso for Logistic Regression The lasso penalty is a well-known way to directly encode the aim of variable selection into the problem of minimizing the empirical error of a generalized linear predictor: min β ,β0 ( n ∑ i=1 L(yi,β T xi +β0)+λ p ∑ j=1 |βi| )