Exploring the relationship between voice similarity estimates by listeners and by an automatic speaker recognition system incorporating phonetic features

Abstract The present study investigates relationships between voice similarity ratings made by human listeners and comparison scores produced by an automatic speaker recognition system that includes phonetic, perceptually-relevant features in its modelling. The study analyses human voice similarity ratings of pairs of speech samples from unrelated speakers from an accent-controlled database (DyViS, Standard Southern British English) and the comparison scores from an i-vector-based automatic speaker recognition system using ‘auto-phonetic’ (automatically extracted phonetic) features. The voice similarity ratings were obtained from 106 listeners who each rated the voice similarity of pairings of ten speakers on a Likert scale via an online test. Correlation analysis and Multidimensional Scaling showed a positive relationship between listeners’ judgements and the automatic comparison scores. A separate analysis of the subsets of listener responses from English and German native speaker groups showed that a positive relationship was present for both groups, but that the correlation was higher for the English listener group. This work has key implications for forensic phonetics through highlighting the potential to automate part of the process of selecting foil voices in voice parade construction for which the collection and processing of human judgements is currently needed. Further, establishing that it is possible to use automatic voice comparisons using phonetic features to select similar-sounding voices has important applications in ‘voice casting’ (finding voices that are similar to a given voice) and ‘voice banking’ (saving one's voice for future synthesis in case of an operation or degenerative disease).

[1]  K. Petrini,et al.  Cognitive maturation and the use of pitch and rate information in making similarity judgments of a single talker. , 2008, Journal of speech, language, and hearing research : JSLHR.

[2]  B E Walden,et al.  Correlates of psychological dimensions in talker similarity. , 1978, Journal of speech and hearing research.

[3]  Finnian Kelly,et al.  Deep Neural Network Based Forensic Automatic Speaker Recognition in VOCALISE using x-Vectors , 2019 .

[4]  Pascal Druyts,et al.  Applying Logistic Regression to the Fusion of the NIST'99 1-Speaker Submissions , 2000, Digit. Signal Process..

[5]  Anders Eriksson,et al.  Voice similarity — a comparison between judgements by human listeners and automatic voice comparison , 2010 .

[6]  Bruno L. Giordano,et al.  A language-familiarity effect for speaker discrimination without comprehension , 2014, Proceedings of the National Academy of Sciences.

[7]  Gyslain Giguère,et al.  Collecting and analyzing data in multidimensional scaling experiments: A guide for psychologists using SPSS , 2006 .

[8]  C. Sherrin Earwitness Evidence: The Reliability of Voice Identifications , 2015 .

[9]  Neil Salkind Encyclopedia of Measurement and Statistics , 2006 .

[10]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[11]  Axel Röbel,et al.  Similarity Search of Acted Voices for Automatic Voice Casting , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Francis Nolan,et al.  Some Acoustic Correlates of Perceived (Dis)Similarity between Same-accent Voices , 2011, ICPhS.

[13]  Tyler K. Perrachione,et al.  Acoustic and linguistic factors affecting perceptual dissimilarity judgments of voices. , 2019, The Journal of the Acoustical Society of America.

[14]  Tyler K. Perrachione Speaker recognition across languages , 2017 .

[15]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[16]  Anil Alexander,et al.  Identifying Perceptually Similar Voices with a Speaker Recognition System Using Auto-Phonetic Features , 2016, INTERSPEECH.

[17]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[18]  Abeer Alwan,et al.  Towards understanding speaker discrimination abilities in humans and machines for text-independent short utterances of different speech styles. , 2018, The Journal of the Acoustical Society of America.

[19]  Francis Nolan A recent voice parade , 2003 .

[20]  K. McDougall Assessing perceived voice similarity using Multidimensional Scaling for the construction of voice parades , 2013 .

[21]  Francis Nolan,et al.  The DyViS database: style-controlled recordings of 100 homogeneous speakers for forensic phonetic research , 2009 .

[22]  Francis Nolan,et al.  Voice lineups: A practical guide , 2015, ICPhS.

[23]  Robert E Remez,et al.  On the perception of similarity among talkers. , 2007, The Journal of the Acoustical Society of America.

[24]  Niels O. Schiller,et al.  Different influences of the native language of a listener on speaker recognition , 1997 .

[25]  Pascal Belin,et al.  Perceptual scaling of voice identity: common dimensions for different vowels and speakers , 2010, Psychological research.