A Framework for Evaluating Speech Representations

Listeners track distributions of speech sounds along perceptual dimensions. We introduce a method for evaluating hypotheses about what those dimensions are, using a cognitive model whose prior distribution is estimated directly from speech recordings. We use this method to evaluate two speaker normalization algorithms against human data. Simulations show that representations that are normalized across speakers predict human discrimination data better than unnormalized representations, consistent with previous research. Results further reveal differences across normalization methods in how well each predicts human data. This work provides a framework for evaluating hypothesized representations of speech and lays the groundwork for testing models of speech perception on natural speech recordings from ecologically valid settings.

[1]  B. Lobanov Classification of Russian Vowels Spoken by Different Speakers , 1971 .

[2]  D. Pisoni,et al.  Reaction times to comparisons within and across phonetic categories , 1974, Perception & psychophysics.

[3]  T. M. Nearey Phonetic feature systems for vowels , 1978 .

[4]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[5]  J. Hillenbrand,et al.  Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[6]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Jessica Maye,et al.  Infant sensitivity to distributional information can affect phonetic discrimination , 2002, Cognition.

[8]  Roel Smits,et al.  A comparison of vowel normalization procedures for language variation research. , 2004, The Journal of the Acoustical Society of America.

[9]  George Saon,et al.  Feature and model space speaker adaptation with full covariance Gaussians , 2006, INTERSPEECH.

[10]  David B. Pisoni,et al.  The Nationwide Speech Project: A new corpus of American English dialects , 2006, Speech Commun..

[11]  Sanjeev R. Kulkarni,et al.  A Nearest-Neighbor Approach to Estimating Divergence between Continuous Random Vectors , 2006, 2006 IEEE International Symposium on Information Theory.

[12]  Fabio Brugnara,et al.  Improved automatic speech recognition through speaker normalization , 2006, Comput. Speech Lang..

[13]  R. Jacobs,et al.  Perception of speech reflects optimal use of probabilistic speech cues , 2008, Cognition.

[14]  Naomi H. Feldman,et al.  The influence of categories on perception: explaining the perceptual magnet effect as optimal statistical inference. , 2009, Psychological review.

[15]  Adam N Sanborn,et al.  Exemplar models as a mechanism for performing Bayesian inference , 2010, Psychonomic bulletin & review.

[16]  Jennifer Cole,et al.  Unmasking the acoustic effects of vowel-to-vowel coarticulation: A statistical modeling approach , 2010, J. Phonetics.

[17]  B. McMurray,et al.  What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. , 2011, Psychological review.

[18]  Naomi Feldman,et al.  A Unified Model of Categorical Effects in Consonant and Vowel Perception , 2012, CogSci.

[19]  Keith S. Apfelbaum,et al.  Relative cue encoding in the context of sophisticated models of categorization: Separating information from categorization , 2015, Psychonomic bulletin & review.