Target and Non-target Speaker Discrimination by Humans and Machines

The manner in which acoustic features contribute to perceiving speaker identity remains unclear. In an attempt to better understand speaker perception, we investigated human and machine speaker discrimination with utterances shorter than 2 seconds. Sixty-five listeners performed a same vs. different task. Machine performance was estimated with i-vector/PLDA-based automatic speaker verification systems, one using mel-frequency cepstral coefficients (MFCCs) and the other using voice quality features (VQual2) inspired by a psychoacoustic model of voice quality. Machine performance was measured in terms of the detection and log-likelihood-ratio cost functions. Humans showed higher confidence for correct target decisions compared to correct non-target decisions, suggesting that they rely on different features and/or decision making strategies when identifying a single speaker compared to when distinguishing between speakers. For non-target trials, responses were highly correlated between humans and the VQual2-based system, especially when speakers were perceptually marked. Fusing human responses with an MFCC-based system improved performance over human-only or MFCC-only results, while fusing with the VQual2-based system did not. The study is a step towards understanding human speaker discrimination strategies and suggests that automatic systems might be able to supplement human decisions especially when speakers are marked.

[1]  Katharina von Kriegstein,et al.  How do we recognise who is speaking? , 2014, Frontiers in bioscience.

[2]  Solange Rossato,et al.  Speaker verification by inexperienced and experienced listeners vs. speaker verification system , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Philip Harrison,et al.  Mapping Across Feature Spaces in Forensic Voice Comparison: The Contribution of Auditory-Based Voice Quality to (Semi-)Automatic System Testing , 2017, INTERSPEECH.

[4]  Alvin F. Martin,et al.  NIST Speaker Recognition Evaluation Chronicles - Part 2 , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[5]  Abeer Alwan,et al.  Speaker Identity and Voice Quality: Modeling Human Responses and Automatic Speaker Recognition , 2016, INTERSPEECH.

[6]  Bin Ma,et al.  Approaching human listener accuracy with modern speaker verification , 2010, INTERSPEECH.

[7]  Robin A. Samlan,et al.  Modeling the voice source in terms of spectral slopes. , 2016, The Journal of the Acoustical Society of America.

[8]  J. Hillenbrand,et al.  Acoustic correlates of breathy vocal quality. , 1994, Journal of speech and hearing research.

[9]  Alvin F. Martin,et al.  NIST speaker recognition evaluation chronicles , 2004, Odyssey.

[10]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Alvin F. Martin,et al.  NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels , 2009, INTERSPEECH.

[12]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[13]  David A. van Leeuwen,et al.  Fusion of Heterogeneous Speaker Recognition Systems in the STBU Submission for the NIST Speaker Recognition Evaluation 2006 , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Abeer Alwan,et al.  The relationship between acoustic and perceived intraspeaker variability in voice quality , 2015, INTERSPEECH.

[15]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[16]  Abeer Alwan,et al.  Towards understanding speaker discrimination abilities in humans and machines for text-independent short utterances of different speech styles. , 2018, The Journal of the Acoustical Society of America.

[17]  Sarah V. Stevenage,et al.  Drawing a distinction between familiar and unfamiliar voice processing: A review of neuropsychological, clinical and empirical findings , 2017, Neuropsychologia.

[18]  Abeer Alwan,et al.  Using Voice Quality Features to Improve Short-Utterance, Text-Independent Speaker Verification Systems , 2017, INTERSPEECH.

[19]  Rohan Kumar Das,et al.  Significance of constraining text in limited data text-independent speaker verification , 2016, 2016 International Conference on Signal Processing and Communications (SPCOM).

[20]  Eugenia San Segundo,et al.  Holistic perception of voice quality matters more than L1 when judging speaker similarity in short stimuli , 2016 .

[21]  Douglas A. Reynolds,et al.  Comparison of background normalization methods for text-independent speaker verification , 1997, EUROSPEECH.

[22]  Robin A. Samlan,et al.  Toward a unified theory of voice production and perception , 2014, Loquens.

[23]  D. Lancker,et al.  Voice discrimination and recognition are separate abilities , 1987, Neuropsychologia.

[24]  Verena G. Skuk,et al.  Speaker perception. , 2014, Wiley interdisciplinary reviews. Cognitive science.

[25]  Jody Kreiman,et al.  Comparing discrimination and recognition of unfamiliar voices , 1991, Speech Commun..

[26]  Tomi Kinnunen,et al.  Merging human and automatic system decisions to improve speaker recognition performance , 2013, INTERSPEECH.

[27]  G Papcun,et al.  Long-term memory for unfamiliar voices. , 1989, The Journal of the Acoustical Society of America.

[28]  David A. van Leeuwen,et al.  An Introduction to Application-Independent Evaluation of Speaker Recognition Systems , 2007, Speaker Classification.

[29]  Peter French,et al.  An International Investigation of Forensic Speaker Comparison Practices , 2011, ICPhS.