Innovative Method for Unsupervised Voice Activity Detection and Classification of Audio Segments

An accurate and noise-robust voice activity detection (VAD) system can be widely used for emerging speech technologies in the fields of audio forensics, wireless communication, and speech recognition. However, in real-life application, the sufficient amount of data or human-annotated data to train such a system may not be available. Therefore, a supervised system for VAD cannot be used in such situations. In this paper, an unsupervised method for VAD is proposed to label the segments of speech-presence and speech-absence in an audio. To make the proposed method efficient and computationally fast, it is implemented by using long-term features that are computed by using the Katz algorithm of fractal dimension estimation. Two databases of different languages are used to evaluate the performance of the proposed method. The first is Texas Instruments Massachusetts Institute of Technology (TIMIT) database, and the second is the King Saud University (KSU) Arabic speech database. The language of TIMIT is English, while the language of the KSU speech database is Arabic. TIMIT is recorded in only one environment, whereas the KSU speech database is recorded in distinct environments using various recording systems that contain sound cards of different qualities and models. The evaluation of the proposed method suggested that it labels voiced and unvoiced segments reliably in both clean and noisy audio.

[1]  P. Agostino Accardo,et al.  Use of the fractal dimension for the analysis of electroencephalographic time series , 1997, Biological Cybernetics.

[2]  Hemant A. Patil,et al.  A comparison of waveform fractal dimension techniques for voice pathology classification , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[4]  Arthur Petrosian,et al.  Kolmogorov complexity of finite sequences and recognition of different preictal EEG patterns , 1995, Proceedings Eighth IEEE Symposium on Computer-Based Medical Systems.

[5]  Ji Wu,et al.  Maximum Margin Clustering Based Statistical VAD With Multiple Observation Compound Feature , 2011, IEEE Signal Processing Letters.

[6]  Mansour Alsulaiman,et al.  KSU rich Arabic speech database , 2013 .

[7]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[8]  Abeer Alwan,et al.  Voice activity detection using harmonic frequency components in likelihood ratio test , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Muhammad Ghulam,et al.  KSU Speech Database: Text Selection, Recording and Verification , 2013, 2013 European Modelling Symposium.

[10]  Amit K. Mishra,et al.  Local fractal dimension based ECG arrhythmia classification , 2010, Biomed. Signal Process. Control..

[11]  Nacim Betrouni,et al.  Fractal and multifractal analysis: A review , 2009, Medical Image Anal..

[12]  K. Krieble,et al.  Differentiation of alpha coma from awake alpha by nonlinear dynamics of electroencephalography. , 1996, Electroencephalography and clinical neurophysiology.

[13]  Jiqing Han,et al.  Likelihood ratio sign test for voice activity detection , 2012, IET Signal Process..

[14]  Christian A. Müller,et al.  Prosodic and other Long-Term Features for Speaker Diarization , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  T. Higuchi Approach to an irregular time series on the basis of the fractal theory , 1988 .

[16]  Rui Yang,et al.  Copy-move detection of audio recording with pitch similarity , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Alan J. Cooper Detecting Butt-Spliced Edits in Forensic Digital Audio Recordings , 2010 .

[18]  Xianglong Liu,et al.  An improved noise-robust voice activity detector based on hidden semi-Markov models , 2011, Pattern Recognit. Lett..

[19]  Jie Zhu,et al.  A Robust Voice Activity Detection Method Based on Speech Enhancement , 2013 .

[20]  D. Narayana Dutt,et al.  A note on fractal dimensions of biomedical waveforms , 2009, Comput. Biol. Medicine.

[21]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[22]  Muhammad Imran,et al.  An Automatic Digital Audio Authentication/Forensics System , 2017, IEEE Access.

[23]  Marc Lavielle,et al.  Using penalized contrasts for the change-point problem , 2005, Signal Process..

[24]  Sanjit K. Mitra,et al.  Voice activity detection based on multiple statistical models , 2006, IEEE Transactions on Signal Processing.

[25]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[26]  Sheeraz Akram,et al.  Blind Detection of Copy-Move Forgery in Digital Audio Forensics , 2017, IEEE Access.

[27]  Joon-Hyuk Chang,et al.  Dempster-Shafer theory for enhanced statistical model-based voice activity detection , 2018, Comput. Speech Lang..

[28]  Hoirin Kim,et al.  Multiple Acoustic Model-Based Discriminative Likelihood Ratio Weighting for Voice Activity Detection , 2012, IEEE Signal Processing Letters.

[29]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[30]  Joon-Hyuk Chang,et al.  Voice Activity Detection Based on Statistical Model Employing Deep Neural Network , 2014, 2014 Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing.

[31]  Erik L.J. Bohez,et al.  Amplitude scale method: new and efficient approach to measure fractal dimension of speech waveforms , 1992 .

[32]  Brian Litt,et al.  A comparison of waveform fractal dimension algorithms , 2001 .

[33]  Lin-Shan Lee,et al.  Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Tatsuya Kawahara,et al.  Using variational bayes free energy for unsupervised voice activity detection , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  M. J. Katz,et al.  Fractals and the analysis of waveforms. , 1988, Computers in biology and medicine.

[36]  Petros Maragos,et al.  Fractal aspects of speech signals: dimension and interpolation , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[37]  Xing Zhang,et al.  Detecting splicing in digital audios using local noise level estimation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[39]  Musaed Alhussein,et al.  Automatic Gender Detection Based on Characteristics of Vocal Folds for Mobile Healthcare System , 2016, Mob. Inf. Syst..

[40]  Philip de Chazal,et al.  Telephony-based voice pathology assessment using automated speech analysis , 2006, IEEE Transactions on Biomedical Engineering.