Prosodic stress detection for fixed stress languages using formal atom decomposition and a statistical hidden Markov hybrid

Abstract The detection of prosodic events, prosodic stress, and speech segmentation based on prosody have received much attention in the research community in the past decades. Prosody is relevant for both main areas of speech technology, text-to-speech synthesis and automatic speech recognition and understanding, and is exploited increasingly: besides providing redundancy, prosody is recognized to carry information unavailable from other sources and also contributes to the naturalness of the perceived speech. This paper addresses a recently proposed intonation analysis technique, called Weighted Correlation based Atom Decomposition (WCAD). The WCAD approach is inspired by the physiology of speech production and the Fujisaki-model used in speech synthesis, however, it is employed in an analytic, and not in a generative approach: the intonation contour is decomposed into a set of elementary components, called atoms, by a pattern matching algorithm. The obtained atom decomposition is used for prosodic stress detection and automatic phonological phrasing. We compare and also combine the WCAD approach to a phonological approach, which relies on automatic segmentation for phonological phrases using a Gaussian Mixture Model (GMM) / Hidden Markov Model (HMM) model and Viterbi-alignment. Results show comparable performance of the physiologically inspired system to the phonologically conceived one in phonological phrasing for two fixed stress languages of different language families: Hungarian and French. By this we also intend to experimentally confirm that the physiologically inspired WCAD model is able to predict or extract linguistically relevant markers linked to meaning. Finally, a hybrid model is proposed, combining the physiologically and the phonologically inspired approaches, and evaluated in phonological phrase and prosodic stress detection in both languages. The performance of the hybrid model is found to be superior to both individual systems. The basic algorithmic steps targeting feature extraction and atom decomposition, as a whole, are applicable to a wide range of languages. However, linking these to linguistic levels and meaning is by nature language specific, i.e. determining which event refers to which linguistic cue or function cannot be defined without knowing the language.

[1]  D. Massaro Understanding Language: An Information-Processing Analysis of Speech Perception, Reading, and Psycholinguistics , 2014 .

[2]  I. Lehiste chapter 7 – Suprasegmental Features of Speech , 1976 .

[3]  Yannick Marchand,et al.  Automatic Syllabification in English: A Comparison of Different Algorithms , 2009, Language and speech.

[4]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[5]  Zoran Ivanovski,et al.  Unified Prosody Model based on Atom Decomposition for Emphasis Detection , 2016 .

[6]  András Beke,et al.  Exploiting Prosody for Automatic Syntactic Phrase Boundary Detection in Speech , 2012 .

[7]  Santitham Prom-on,et al.  Modeling tone and intonation in Mandarin and English as a process of target approximation. , 2009, The Journal of the Acoustical Society of America.

[8]  Elmar Nöth,et al.  Integrated recognition of words and prosodic phrase boundaries , 2002, Speech Commun..

[9]  Philip N. Garner,et al.  Emphasis recreation for TTS using intonation atoms , 2016, SSW.

[10]  Gökhan Tür,et al.  Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation , 2001, CL.

[11]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[12]  Shrikanth S. Narayanan,et al.  Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Gökhan Tür,et al.  Combining words and prosody for information extraction from speech , 1999, EUROSPEECH.

[14]  I. Fónagy,et al.  Electrophysiological and acoustic correlates of stress and stress perception. , 1966, Journal of speech and hearing research.

[15]  Philip N. Garner,et al.  Weighted correlation based atom decomposition intonation modelling , 2015, INTERSPEECH.

[16]  M. Swerts Filled pauses as markers of discourse structure , 1998 .

[17]  András Beke,et al.  Automatic Summarization of Highly Spontaneous Speech , 2016, SPECOM.

[18]  Hiroya Fujisaki,et al.  The roles of physiology, physics and mathematics in modeling prosodic features of speech , 2006, Speech Prosody 2006.

[19]  Daniel Hirst,et al.  Levels of Representation and Levels of Analysis for the Description of Intonation Systems , 2000 .

[20]  Philip N. Garner,et al.  Intonation modelling using a muscle model and perceptually weighted matching pursuit , 2018, Speech Commun..

[21]  Bob L. Sturm,et al.  Analysis, Visualization, and Transformation of Audio Signals Using Dictionary-based Methods , 2009, ICMC.

[22]  A. Cutler,et al.  Cross-language differences in cue use for speech segmentation. , 2009, The Journal of the Acoustical Society of America.

[23]  László Varga Intonation and Stress: Evidence from Hungarian , 2002 .

[24]  Chilin Shih,et al.  Quantitative measurement of prosodic strength in Mandarin , 2003, Speech Commun..

[25]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[26]  Zoran A. Ivanovski,et al.  Design of a Speech Corpus for Research on Cross-Lingual Prosody Transfer , 2016, SPECOM.

[27]  György Szaszák,et al.  Using prosody to improve automatic speech recognition , 2010, Speech Commun..

[28]  Junichi Yamagishi,et al.  The SIWIS Database: A Multilingual Speech Database with Acted Emphasis , 2016, INTERSPEECH.

[29]  N. M. Veilleuz,et al.  Prosody/Parse Scoring and Its Application in ATIS , 1993, HLT.

[30]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  P. Ladefoged A course in phonetics , 1975 .

[32]  Philip N. Garner,et al.  Atom decomposition-based intonation modelling , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Elisabeth Selkirk The Syntax‐Phonology Interface , 2011 .

[34]  Kamal Sarkar,et al.  Bengali text summarization by sentence extraction , 2012, ArXiv.

[35]  Caroline L. Smith French listeners' perceptions of prominence and phrasing are differentially affected by instruction set , 2013 .

[36]  Einar Meister,et al.  BABEL: an Eastern European multi-language database , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[37]  Sharon Peperkamp,et al.  Perception of predictable stress: A cross-linguistic investigation , 2010, J. Phonetics.

[38]  Felicitas Kleber,et al.  Variation of pitch accent patterns in Hungarian , 2010 .

[39]  Harvey Fletcher,et al.  Loudness, its definition, measurement and calculation , 1933 .

[40]  P Taylor,et al.  Analysis and synthesis of intonation using the Tilt model. , 2000, The Journal of the Acoustical Society of America.

[41]  Dik J. Hermes,et al.  Measuring the perceptual similarity of pitch contours , 1995, EUROSPEECH.

[42]  Jan P. H. van Santen,et al.  A quantitative model of F0 generation and alignment , 2000 .