Novel algorithm for speech segregation by optimized k-means of statistical properties of clustered features

To simplify the jobs of speaker diarization and speech separation, at first, speech signal should be segregated to two speech formats, dialog and mixture. This paper describes a new algorithm which achieves that first step efficiently. The algorithm is based on Perceptual Linear Predictive feature extraction, optimized k-means and both top-down & bottom-up scenarios. After extracting features of the observation signal, k-means clusters the statistical properties such as variances of the PDF (histogram) of clustered extracted features. k-means is optimized by discounting the worst pattern of clustering step through doing the k-means procedure twice. The feedback loop is necessary for the guiding of the optimized k-means by exploiting the attributes of ordinary k-means. The results of segregation are excellent. The calculated diarization error rate of outputs is very limited.

[1]  Lucas C. Parra,et al.  A SURVEY OF CONVOLUTIVE BLIND SOURCE SEPARATION METHODS , 2007 .

[2]  Xavier Anguera Miró,et al.  Purity Algorithms for Speaker Diarization of Meetings Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[4]  Kutluyil Dogancay,et al.  Recent trends and challenges in speech-separation systems research — A tutorial review , 2009, TENCON 2009 - 2009 IEEE Region 10 Conference.

[5]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Dong Wang,et al.  A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[9]  Jürgen Schmidhuber,et al.  Phoneme recognition in TIMIT with BLSTM-CTC , 2008, ArXiv.

[10]  Antoine Liutkus,et al.  Informed source separation: Source coding meets source separation , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[11]  Wai Lok Woo,et al.  Informed Single-Channel Speech Separation Using HMM–GMM User-Generated Exemplar Source , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.