A measure of phonetic similarity to quantify pronunciation variation by using ASR technology

It attracts researchers’ interest how to define a quantitative measure of phonetic similarity between IPA transcripts of the same sentence read by two speakers. This problem can be divided into how to align two transcripts and how to quantify alignment gap. In this paper, we introduce a method of similarity calculation using phone-based or phoneme-based acoustic models trained with the algorithm used to develop Automatic Speech Recognition (ASR) systems. Use of acoustic models will introduce an issue of speaker dependency because speech spectrums always convey the information of the training speakers’ age and gender, which is totally irrelevant to phonetic similarity calculation. We examine how independent our method is of training speakers and how close the calculated similarity is to the similarity subjectively rated through a listening test. We also compare our method to recent works and show our method can give higher correlation by 4 points to human-rated similarity.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Keikichi Hirose,et al.  Speaker-basis Accent Clustering Using Invariant Structure Analysis and the Speech Accent Archive , 2014, Odyssey.

[3]  Nobuaki Minematsu,et al.  Speaker-based accented English clustering using a world English archive , 2013, SLaTE.

[4]  Keikichi Hirose,et al.  Visualization of pronunciation diversity of world Englishes from a speaker's self-centered viewpoint , 2014, 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA).

[5]  Steven H. Weinberger,et al.  The Speech Accent Archive: towards a typology of English accents , 2011 .

[6]  John Nerbonne,et al.  A Cognitively Grounded Measure of Pronunciation Distance , 2013, PloS one.

[7]  Eliza Margaretha,et al.  Inducing a measure of phonetic similarity from pronunciation variation , 2012, J. Phonetics.

[8]  Nobuaki Minematsu,et al.  Development of English Speech Database Read by Japanese to Support CALL Research , 2004 .

[9]  Nobuaki Minematsu,et al.  Statistical Voice Conversion Based on Noisy Channel Model , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Ulrike Hahn,et al.  Phoneme similarity and confusability , 2005 .

[11]  Wilbert Jan Heeringa Measuring dialect pronunciation differences using Levenshtein distance , 2004 .

[12]  Andrej Zgank,et al.  Data-driven generation of phonetic broad classes, based on phoneme confusion matrix similarity , 2005, Speech Commun..

[13]  Keith Vertanen Baseline Wsj Acoustic Models for Htk and Sphinx : Training Recipes and Recognition Experiments , 2007 .

[14]  Martijn Wieling,et al.  Measuring foreign accent strength in English : Validating Levenshtein distance as a measure , 2014 .

[15]  Shi-wook Lee,et al.  High priority in highly ranked documents in spoken term detection , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.