Comparison of Two Speech/Music Segmentation Systems For Audio Indexing on the Web

This article talks about two majors ways of performing a speech/music segmentation task. The first one uses a competing modelling approach based on classical speech recognition parameters (MFCC). The second one uses a class/non-class approach for both main topics: speech/non-speech and music/non-music. In order to fit closely speech and music characteristics, different kinds of parameters are used, MFCC and spectral coefficients. We present both approaches with some intrinsic experiments. Then, we compare their speech/music discrimination accuracy using a real-world testing corpus: a broadcast program containing noisy interviews, superimposed segments (speech with music), and an alternation of broad-band speech and telephone speech. Within the classical approach, we can notice that either the derivative alone, or the second derivative alone, plays a major role in the discrimination process as well as the number of cepstral coefficients. In the differentiated way, the class/non-class approach is more homogeneous.