A feature selection-based speaker clustering method for paralinguistic tasks

In recent years, computational paralinguistics has emerged as a new topic within speech technology. It concerns extracting non-linguistic information from speech (such as emotions, the level of conflict, whether the speaker is drunk). It was shown recently that many methods applied here can be assisted by speaker clustering; for example, the features extracted from the utterances could be normalized speaker-wise instead of using a global method. In this paper, we propose a speaker clustering algorithm based on standard clustering approaches like K-means and feature selection. By applying this speaker clustering technique in two paralinguistic tasks, we were able to significantly improve the accuracy scores of several machine learning methods, and we also obtained an insight into what features could be efficiently used to separate the different speakers.

[1]  Róbert Busa-Fekete,et al.  Assessing the degree of nativeness and parkinson's condition using Gaussian processes and deep rectifier neural networks , 2015, INTERSPEECH.

[2]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[3]  Gabor Gosztolya Is AdaBoost competitive for phoneme classification? , 2014, 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI).

[4]  Okko Johannes Räsänen,et al.  Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech , 2013, INTERSPEECH.

[5]  Rahul Gupta,et al.  Paralinguistic event detection from speech using probabilistic time-series smoothing and masking , 2013, INTERSPEECH.

[6]  Gábor Gosztolya Conflict intensity estimation from speech using Greedy forward-backward feature selection , 2015, INTERSPEECH.

[7]  András Beke,et al.  Automatic Laughter Detection in Spontaneous Speech Using GMM-SVM Method , 2013, TSD.

[8]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[10]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[11]  Generalization rules for the suppressed fuzzy c-means clustering algorithm , 2014, Neurocomputing.

[12]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[13]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[14]  Albert Ali Salah,et al.  Canonical correlation analysis and local fisher discriminant analysis based multi-view acoustic feature reduction for physical load prediction , 2014, INTERSPEECH.

[15]  Klára Vicsi,et al.  Speech Emotion Perception by Human and Machine , 2008, COST 2102 Workshop.

[16]  Balázs Kégl,et al.  MULTIBOOST: A Multi-purpose Boosting Package , 2012, J. Mach. Learn. Res..

[17]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[18]  Eric Lecolinet,et al.  A multi-classifier combination strategy for the recognition of handwritten cursive words , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[19]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[20]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[21]  Fabien Ringeval,et al.  I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance , 2016, PloS one.

[22]  Róbert Busa-Fekete,et al.  Determining Native Language and Deception Using Phonetic Features and Classifier Combination , 2016, INTERSPEECH.

[23]  Róbert Busa-Fekete,et al.  Detecting the intensity of cognitive and physical load using AdaBoost and deep rectifier neural networks , 2014, INTERSPEECH.

[24]  Zhigang Luo,et al.  NeNMF: An Optimal Gradient Method for Nonnegative Matrix Factorization , 2012, IEEE Transactions on Signal Processing.

[25]  Róbert Busa-Fekete,et al.  Detecting autism, emotions and social signals using adaboost , 2013, INTERSPEECH.

[26]  László Tóth,et al.  CLASSIFIER COMBINATION IN SPEECH RECOGNITION , 2003 .

[27]  József Dombi,et al.  Applying Representative Uninorms for Phonetic Classifier Combination , 2014, MDAI.

[28]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[29]  Shrikanth S. Narayanan,et al.  Classification of cognitive load from speech using an i-vector framework , 2014, INTERSPEECH.

[30]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[31]  Thomas S. Huang,et al.  Partially Supervised Speaker Clustering , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Kalman Žiha Stress-Strain Interaction Model of Plasticity , 2015 .

[33]  Gábor Gosztolya,et al.  A Hierarchical Evaluation Methodology in Speech Recognition , 2005, Acta Cybern..

[34]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[35]  Fabien Ringeval,et al.  The INTERSPEECH 2014 computational paralinguistics challenge: cognitive & physical load , 2014, INTERSPEECH.

[36]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[37]  J. Stroop Studies of interference in serial verbal reactions. , 1992 .

[38]  L. Szilágyi,et al.  Application of Fuzzy and Possibilistic c-Means Clustering Models in Blind Speaker Clustering , 2015 .

[39]  Paul Deléglise,et al.  Recent Improvements on ILP-based Clustering for Broadcast News Speaker Diarization , 2014, Odyssey.

[40]  Björn W. Schuller,et al.  Affect recognition in real-life acoustic conditions - a new perspective on feature selection , 2013, INTERSPEECH.

[41]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[42]  Elmar Nöth,et al.  The INTERSPEECH 2015 computational paralinguistics challenge: nativeness, parkinson's & eating condition , 2015, INTERSPEECH.

[43]  László Tóth,et al.  Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Róbert Busa-Fekete,et al.  GraphClus, a MATLAB program for cluster analysis using graph theory , 2009, Comput. Geosci..

[45]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[46]  Fabio Valente,et al.  The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism , 2013, INTERSPEECH.

[47]  Shrikanth S. Narayanan,et al.  Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling , 2008, INTERSPEECH.

[48]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).