Intoxicated Speech Detection by Fusion of Speaker Normalized Hierarchical Features and GMM Supervectors

Speaker state recognition is a challenging problem due to speaker and context variability. Intoxication detection is an important area of paralinguistic speech research with potential real-world applications. In this work, we build upon a base set of various static acoustic features by proposing the combination of several different methods for this learning task. The methods include extracting hierarchical acoustic features, performing iterative speaker normalization, and using a set of GMM supervectors. We obtain an optimal unweighted recall for intoxication recognition using score-level fusion of these subsystems. Unweighted average recall performance is 70.54% on the test set, an improvement of 4.64% absolute (7.04% relative) over the baseline model accuracy of 65.9%. Index Terms: intoxication detection, speaker state, hierarchical features, speaker normalization, GMM supervectors

[1]  Athanasios Katsamanis,et al.  "You made me do it": Classification of Blame in Married Couples' Interactions by Fusing Automatically Derived Speech and Language Information , 2011, INTERSPEECH.

[2]  L C Sobell,et al.  Effects of alcohol on the speech of alcoholics. , 1972, Journal of speech and hearing research.

[3]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[4]  Yoshiyuki Horii,et al.  Pause and utterance durations and fundamental frequency characteristics of repeated oral readings by stutterers and nonstutterers , 1987 .

[5]  Martin Golz,et al.  Acoustic sleepiness detection: Framework and validation of a speech-adapted pattern recognition approach , 2009, Behavior research methods.

[6]  Björn W. Schuller,et al.  Brute-forcing hierarchical functionals for paralinguistics: A waste of feature space? , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Shrikanth S. Narayanan,et al.  Automatic speaker age and gender recognition using acoustic and prosodic level information fusion , 2013, Comput. Speech Lang..

[8]  Björn W. Schuller,et al.  The INTERSPEECH 2011 Speaker State Challenge , 2011, INTERSPEECH.

[9]  D B Pisoni,et al.  Effects of alcohol on the acoustic-phonetic properties of speech: perceptual and acoustic analyses. , 1989, Alcoholism, clinical and experimental research.

[10]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Florian Schiel,et al.  Rhythm and formant features for automatic alcohol detection , 2010, INTERSPEECH.

[12]  Ming Li,et al.  Combining five acoustic level modeling methods for automatic speaker age and gender recognition , 2010, INTERSPEECH.

[13]  Carlos Busso,et al.  Iterative feature normalization for emotional speech detection , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Björn W. Schuller,et al.  OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.