Evaluating Low-Level Speech Features Against Human Perceptual Data

We introduce a method for measuring the correspondence between low-level speech features and human perception, using a cognitive model of speech perception implemented directly on speech recordings. We evaluate two speaker normalization techniques using this method and find that in both cases, speech features that are normalized across speakers predict human data better than unnormalized speech features, consistent with previous research. Results further reveal differences across normalization methods in how well each predicts human data. This work provides a new framework for evaluating low-level representations of speech on their match to human perception, and lays the groundwork for creating more ecologically valid models of speech perception.

[1]  Keith Johnson,et al.  Resonance in an exemplar-based lexicon: The emergence of social identity and phonology , 2006, J. Phonetics.

[2]  William J. Byrne,et al.  Acoustic training from heterogeneous data sources: experiments in Mandarin conversational telephone speech transcription , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[4]  T. M. Nearey Static, dynamic, and relational properties in vowel perception. , 1989, The Journal of the Acoustical Society of America.

[5]  Martin Karafiát,et al.  Further investigation into multilingual training and adaptation of stacked bottle-neck neural network structure , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[6]  J. D. Miller,et al.  Auditory-perceptual interpretation of the vowel. , 1989, The Journal of the Acoustical Society of America.

[7]  Dave F. Kleinschmidt,et al.  Robust speech perception: recognize the familiar, generalize to the similar, and adapt to the novel. , 2015, Psychological review.

[8]  Kaori Idemaru,et al.  Specificity of dimension-based statistical learning in word recognition. , 2014, Journal of experimental psychology. Human perception and performance.

[9]  Benjamin Halberstam,et al.  Vowel normalization: the role of fundamental frequency and upper formants , 2004, J. Phonetics.

[10]  Keith S. Apfelbaum,et al.  Relative cue encoding in the context of sophisticated models of categorization: Separating information from categorization , 2015, Psychonomic bulletin & review.

[11]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Fabio Brugnara,et al.  Improved automatic speech recognition through speaker normalization , 2006, Comput. Speech Lang..

[13]  S. Nittrouer Age-related differences in perceptual effects of formant transitions within syllables and across syllable boundaries , 1992 .

[14]  Jeff A. Bilmes,et al.  Unsupervised learning of acoustic features via deep canonical correlation analysis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jordan Cohen,et al.  Vocal tract normalization in speech recognition: Compensating for systematic speaker variability , 1995 .

[16]  B. Repp Phonetic trading relations and context effects: new experimental evidence for a speech mode of perception. , 1982, Psychological bulletin.

[17]  D. Dahan,et al.  Talker adaptation in speech perception: Adjusting the signal or the representations? , 2008, Cognition.

[18]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[19]  J. Hillenbrand,et al.  Acoustic characteristics of American English vowels. , 1994, The Journal of the Acoustical Society of America.

[20]  Philip J. Monahan,et al.  Auditory sensitivity to formant ratios: Toward an account of vowel normalisation , 2010, Language and cognitive processes.

[21]  Shrikanth S. Narayanan,et al.  Effect of spectral normalization on different talker speech recognition by cochlear implant users. , 2008, The Journal of the Acoustical Society of America.

[22]  S. Wegmann,et al.  Speaker normalization on conversational telephone speech , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23]  Jennifer Cole,et al.  Unmasking the acoustic effects of vowel-to-vowel coarticulation: A statistical modeling approach , 2010, J. Phonetics.

[24]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[25]  Aren Jansen,et al.  Rapid Evaluation of Speech Representations for Spoken Term Discovery , 2011, INTERSPEECH.

[26]  Hermann Ney,et al.  Acoustic front-end optimization for large vocabulary speech recognition , 1997, EUROSPEECH.

[27]  Jessica Maye,et al.  Infant sensitivity to distributional information can affect phonetic discrimination , 2002, Cognition.

[28]  George Saon,et al.  Feature and model space speaker adaptation with full covariance Gaussians , 2006, INTERSPEECH.

[29]  B. Lobanov Classification of Russian Vowels Spoken by Different Speakers , 1971 .

[30]  Kaori Idemaru,et al.  Word recognition reflects dimension-based statistical learning. , 2011, Journal of experimental psychology. Human perception and performance.

[31]  Emily B. Myers,et al.  The Perception of Voice Onset Time: An fMRI Investigation of Phonetic Category Structure , 2005, Journal of Cognitive Neuroscience.

[32]  S. Molau,et al.  Feature space normalization in adverse acoustic conditions , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[33]  T. M. Nearey Phonetic feature systems for vowels , 1978 .

[34]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Joshua B. Tenenbaum,et al.  Phrase similarity in humans and machines , 2015, CogSci.

[36]  S. Blumstein,et al.  The effect of subphonetic differences on lexical access , 1994, Cognition.

[37]  Sanjeev R. Kulkarni,et al.  A Nearest-Neighbor Approach to Estimating Divergence between Continuous Random Vectors , 2006, 2006 IEEE International Symposium on Information Theory.

[38]  Adam N Sanborn,et al.  Exemplar models as a mechanism for performing Bayesian inference , 2010, Psychonomic bulletin & review.

[39]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[40]  Roel Smits,et al.  A comparison of vowel normalization procedures for language variation research. , 2004, The Journal of the Acoustical Society of America.

[41]  Hynek Hermansky,et al.  Evaluation and optimization of perceptually-based ASR front-end , 1993, IEEE Trans. Speech Audio Process..

[42]  Hynek Hermansky,et al.  Evaluating speech features with the minimal-pair ABX task (II): resistance to noise , 2014, INTERSPEECH.

[43]  Reinhold Häb-Umbach Investigations on inter-speaker variability in the feature space , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[44]  Naomi H. Feldman,et al.  The influence of categories on perception: explaining the perceptual magnet effect as optimal statistical inference. , 2009, Psychological review.

[45]  David B. Pisoni,et al.  The Nationwide Speech Project: A new corpus of American English dialects , 2006, Speech Commun..

[46]  Aren Jansen,et al.  Evaluating speech features with the minimal-pair ABX task: analysis of the classical MFC/PLP pipeline , 2013, INTERSPEECH.

[47]  B. McMurray,et al.  What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. , 2011, Psychological review.

[48]  G. E. Peterson Parameters of vowel quality. , 1961, Journal of speech and hearing research.

[49]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[50]  M E Miller,et al.  Predicting developmental shifts in perceptual weighting schemes. , 1997, The Journal of the Acoustical Society of America.

[51]  R. Jacobs,et al.  Perception of speech reflects optimal use of probabilistic speech cues , 2008, Cognition.

[52]  Santiago Barreda,et al.  Vowel normalization and the perception of speaker changes: an exploration of the contextual tuning hypothesis. , 2012, The Journal of the Acoustical Society of America.

[53]  D. Pisoni,et al.  Reaction times to comparisons within and across phonetic categories , 1974, Perception & psychophysics.

[54]  Yakov Kronrod,et al.  A unified account of categorical effects in phonetic perception , 2016, Psychonomic bulletin & review.

[55]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[56]  Ran Liu,et al.  Dimension-based statistical learning of vowels. , 2015, Journal of experimental psychology. Human perception and performance.

[57]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[58]  Aren Jansen,et al.  Unsupervised neural network based feature extraction using weak top-down constraints , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[61]  Marc F Joanisse,et al.  Mismatch negativity reflects sensory and phonetic speech processing , 2007, Neuroreport.

[62]  A. Lotto,et al.  Cue weighting in auditory categorization: implications for first and second language acquisition. , 2006, The Journal of the Acoustical Society of America.

[63]  Kaori Idemaru,et al.  Individual differences in cue weights are stable across time: the case of Japanese stop lengths. , 2012, The Journal of the Acoustical Society of America.

[64]  Joseph C. Toscano,et al.  Continuous Perception and Graded Categorization , 2010, Psychological science.

[65]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..