Prosodic Representations of Prominence Classification Neural Networks and Autoencoders Using Bottleneck Features

Prominence perception has been known to correlate with a complex interplay of the acoustic features of energy, fundamental frequency, spectral tilt, and duration. The contribution and importance of each of these features in distinguishing between prominent and non-prominent units in speech is not always easy to determine, and more so, the prosodic representations that humans and automatic classifiers learn have been difficult to interpret. This work focuses on examining the acoustic prosodic representations that binary prominence classification neural networks and autoencoders learn for prominence. We investigate the complex features learned at different layers of the network as well as the 10-dimensional bottleneck features (BNFs), for the standard acoustic prosodic correlates of prominence separately and in combination. We analyze and visualize the BNFs obtained from the prominence classification neural networks as well as their network activations. The experiments are conducted on a corpus of Dutch continuous speech with manually annotated prominence labels. Our results show that the prosodic representations obtained from the BNFs and higher-dimensional non-BNFs provide good separation of the two prominence categories, with, however, different partitioning of the BNF space for the distinct features, and the best overall separation obtained for F0.

[1]  George Christodoulides,et al.  An evaluation of machine learning methods for prominence detection in French , 2014, INTERSPEECH.

[2]  D. Fry Duration and Intensity as Physical Correlates of Linguistic Stress , 1954 .

[3]  Paavo Alku,et al.  Evaluation of Spectral Tilt Measures for Sentence Prominence Under Different Noise Conditions , 2017, INTERSPEECH.

[4]  Ann Cutler,et al.  Prosody in the Comprehension of Spoken Language: A Literature Review , 1997, Language and speech.

[5]  Hugo Van hamme,et al.  Use and Evaluation of Prosodic Annotations in Dutch , 2004, LREC.

[6]  Juraj Simko,et al.  Hierarchical representation and estimation of prosody using continuous wavelet transform , 2017, Comput. Speech Lang..

[7]  Pier Marco Bertinetto,et al.  Prosodic prominence detection in Italian continuous speech using probabilistic graphical models , 2014 .

[8]  Lyan Verwimp,et al.  Analyzing the Contribution of Top-Down Lexical and Bottom-Up Acoustic Cues in the Detection of Sentence Prominence , 2016, INTERSPEECH.

[9]  Paavo Alku,et al.  Comparison of spectral tilt measures for sentence prominence in speech - Effects of dimensionality and adverse noise conditions , 2018, Speech Commun..

[10]  H. H. Rump,et al.  The perceptual prominence of fundamental frequency peaks. , 1997, The Journal of the Acoustical Society of America.

[11]  Petra Wagner,et al.  Different parts of the same elephant: A roadmap to disentangle and connect different perspectives on prosodic prominence , 2015, ICPhS.

[12]  Duane G. Watson,et al.  Experimental and theoretical advances in prosody: A review , 2010, Language and cognitive processes.

[13]  Hongbing Hu,et al.  A spectral/temporal method for robust fundamental frequency tracking. , 2008, The Journal of the Acoustical Society of America.

[14]  Okko Johannes Räsänen,et al.  Analyzing distributional learning of phonemic categories in unsupervised deep neural networks , 2016, CogSci.

[15]  P. Lieberman Some Acoustic Correlates of Word Stress in American English , 1959 .

[16]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[17]  Lou Boves,et al.  Experiences from the Spoken Dutch Corpus Project , 2002, LREC.

[18]  Stefanie Shattuck-Hufnagel,et al.  A prosody tutorial for investigators of auditory sentence processing , 1996, Journal of psycholinguistic research.

[19]  J. Terken Fundamental frequency and perceived prominence of accented syllables. , 1991, The Journal of the Acoustical Society of America.

[20]  Jmb Jacques Terken,et al.  The perception of prosodic prominence , 2000 .

[21]  Martin J. Russell,et al.  Exploring How Phone Classification Neural Networks Learn Phonetic Information by Visualising and Interpreting Bottleneck Features , 2018, INTERSPEECH.

[22]  Martha Larson,et al.  The Representation of Speech in Deep Neural Networks , 2019, MMM.

[23]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[24]  Marc Swerts,et al.  Annotation of prominent words, prosodic boundaries and segmental lengthening by non-expert transcribers in the Spoken Dutch Corpus , 2002, LREC.

[25]  Okko Johannes Räsänen,et al.  3PRO - An unsupervised method for the automatic detection of sentence prominence in speech , 2016, Speech Commun..

[26]  Agaath M. C. Sluijter,et al.  Spectral balance as an acoustic correlate of linguistic stress. , 1996, The Journal of the Acoustical Society of America.

[27]  B. Rosner,et al.  Loudness predicts prominence: fundamental frequency lends little. , 2005, The Journal of the Acoustical Society of America.