Random forest algorithm for improving the performance of speech/non-speech detection

Speech/non-speech detection (SND) distinguishes between speech and non-speech segments in recorded audio and video documents. SND systems can help reduce the storage space required when only speech segments from the audio documents are required, for example content analysis, spoken language identification, etc. In this work, we experimented with the use of time domain, frequency domain and cepstral domain features for short time frames of 20 ms. size along with their mean and standard deviation for segments of size 200 ms. We then analysed if selecting a subset of the features can help improve the performance of the SND system. Towards this, we experimented with different feature selection algorithms, and observed that correlation based feature selection gave the best results. Further, we experimented with different decision tree classification algorithms, and note that random forest algorithm outperformed other decision tree algorithms. We further improved the SND system performance by smoothing the decisions over 5 segments of 200 ms. each. Our baseline system has 272 features, a classification accuracy of 94.45 % and the final system with 8 features has a classification accuracy of 97.80 %.

[1]  W. Loh,et al.  Classification and Regression Tree Methods ( In Encyclopedia of Statistics in Quality and Reliability , 2008 .

[2]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[3]  Wei-Yin Loh,et al.  Classification and Regression Tree Methods , 2008 .

[4]  Namrata Dave,et al.  Feature Extraction Methods LPC, PLP and MFCC In Speech Recognition , 2013 .

[5]  Diego Castán,et al.  Speech / Music classification by using the C 4 . 5 decision tree algorithm , 2010 .

[6]  Jong-Myon Kim,et al.  An analysis of content-based classification of audio signals using a fuzzy c-means algorithm , 2012, Multimedia Tools and Applications.

[7]  Y. Zhao,et al.  Comparison of decision tree methods for finding active objects , 2007, 0708.4274.

[8]  Georgios Tziritas,et al.  A speech/music discriminator based on RMS and zero-crossings , 2005, IEEE Transactions on Multimedia.

[9]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[10]  Geoff Holmes,et al.  Multiclass Alternating Decision Trees , 2002, ECML.

[11]  Dima Ruinskiy,et al.  A Decision-Tree-Based Algorithm for Speech/Music Classification and Segmentation , 2009, EURASIP J. Audio Speech Music. Process..

[12]  Sergios Theodoridis,et al.  An Overview of Speech/Music Discrimination Techniques in the Context of Audio Recordings , 2008 .

[13]  John H. L. Hansen,et al.  Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Steve Young,et al.  The HTK book , 1995 .

[15]  Erkam Uzun,et al.  A preliminary examination technique for audio evidence to distinguish speech from non-speech using objective speech quality measures , 2014, Speech Commun..