An Improved Speech Segmentation and Clustering Algorithm Based on SOM and K-Means

This paper studies the segmentation and clustering of speaker speech. In order to improve the accuracy of speech endpoint detection, the traditional double-threshold short-time average zero-crossing rate is replaced by a better spectrum centroid feature, and the local maxima of the statistical feature sequence histogram are used to select the threshold, and a new speech endpoint detection algorithm is proposed. Compared with the traditional double-threshold algorithm, it effectively improves the detection accuracy and antinoise in low SNR. The k-means algorithm of conventional clustering needs to give the number of clusters in advance and is greatly affected by the choice of initial cluster centers. At the same time, the self-organizing neural network algorithm converges slowly and cannot provide accurate clustering information. An improved k-means speaker clustering algorithm based on self-organizing neural network is proposed. The number of clusters is predicted by the winning situation of the competitive neurons in the trained network, and the weights of the neurons are used as the initial cluster centers of the k-means algorithm. The experimental results of multiperson mixed speech segmentation show that the proposed algorithm can effectively improve the accuracy of speech clustering and make up for the shortcomings of the k-means algorithm and self-organizing neural network algorithm.

[1]  Tao,et al.  Speech endpoint detection in low-SNRs environment based on perception spectrogram structure boundary parameter , 2014 .

[2]  Lu Yuanya,et al.  Improved speech endpoint detection algorithm in strong noise environment , 2014 .

[3]  Hu Guang Endpoint Detection of Noisy Speech Based on Cepstrum , 2000 .

[4]  Masashi Unoki,et al.  Robust Voice Activity Detection Based on Concept of Modulation Transfer Function in Noisy Reverberant Environments , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[5]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[6]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[7]  Jean-François Bonastre,et al.  Step-by-step and integrated approaches in broadcast news speaker diarization , 2006, Comput. Speech Lang..

[8]  Nilesh V. Patel,et al.  Video classification using speaker identification , 1997, Electronic Imaging.

[9]  Lin Li,et al.  Research of speech endpoint detection based on wavelet analysis and neural networks: Research of speech endpoint detection based on wavelet analysis and neural networks , 2014 .

[10]  Belkacem Fergani,et al.  Speaker diarization using one-class support vector machines , 2008, Speech Commun..

[11]  Nordin Abu Bakar,et al.  An evaluation of endpoint detection measures for malay speech recognition of an isolated words , 2010, 2010 International Symposium on Information Technology.

[12]  Mohadese Eshaghi,et al.  Voice activity detection based on using wavelet packet , 2010, Digit. Signal Process..