Detection of Questions in Arabic Audio Monologues Using Prosodic Features

Prosody has been widely used in many speech-related applications including speaker and word recognition, emotion and accent identification, topic and sentence segmentation, and text-to-speech applications. An important application we investigate is that of identifying question sentences in Arabic monologue lectures. Languages other than Arabic have received a lot of attention in this regard. We approach this problem by first segmenting the sentences from the continuous speech using intensity and duration features. Prosodic features are, then, extracted from each sentence. These features are used as input to decision trees to classify each sentence into either question or non question sentence. Our results suggest that questions are cued by more than one type of prosodic features in natural Arabic speech. We used C4.5 decision trees for classification and achieved 75.7% accuracy. Feature specific analysis further reveals that energy and fundamental frequency features are mainly responsible for discriminating between questions and non-question sentences.

[1]  Elmar Nöth,et al.  VERBMOBIL: the use of prosody in the linguistic components of a speech understanding system , 2000, IEEE Trans. Speech Audio Process..

[2]  Mari Ostendorf,et al.  The use of prosody in syntactic disambiguation , 1991 .

[3]  Eric Fosler-Lussier,et al.  Combining multiple estimators of speaking rate , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Simon King,et al.  Using intonation to constrain language models in speech recognition , 1997, EUROSPEECH.

[5]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[6]  David Burshtein Robust parametric modeling of durations in hidden Markov models , 1996, IEEE Trans. Speech Audio Process..

[7]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Mark Steedman,et al.  Using Prosody in ASR: the Segmentation of Broadcast Radio News , 2002 .

[9]  Daniel Hirst,et al.  Automatic modelling of fundamental frequency using a quadratic sline function , 1993 .

[10]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[11]  Lie Lu,et al.  Speech segmentation without speech recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[12]  John H. L. Hansen,et al.  A comparative study of traditional and newly proposed features for recognition of speech under stress , 2000, IEEE Trans. Speech Audio Process..

[13]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[14]  Chilin Shih,et al.  Comparison of Declarative and Interrogative Intonation in Chinese , 2002 .

[15]  Vincent J. van Heuven,et al.  Intonational characteristics of declarativity in Dutch. a comparison , 1997 .

[16]  Silvia Pfeiffer,et al.  Pause concepts for audio segmentation at different semantic levels , 2001, MULTIMEDIA '01.

[17]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[18]  Andreas Stolcke,et al.  Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? , 1998, Language and speech.

[19]  C H Nakatani,et al.  A corpus-based study of repair cues in spontaneous speech. , 1994, The Journal of the Acoustical Society of America.

[20]  Janet E. Cahn,et al.  A computational memory and processing model for prosody , 1999 .

[21]  M. Swerts Prosodic features at discourse boundaries of different strength. , 1997, The Journal of the Acoustical Society of America.

[22]  Agaath M. C. Sluijter,et al.  Spectral balance as an acoustic correlate of linguistic stress. , 1996, The Journal of the Acoustical Society of America.

[23]  John H. L. Hansen,et al.  Nonlinear feature based classification of speech under stress , 2001, IEEE Trans. Speech Audio Process..

[24]  Mary P. Harper,et al.  An Open Source Prosodic Feature Extraction Tool , 2006, LREC.

[25]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[26]  C. Bartels,et al.  The Intonation of English Statements and Questions: A Compositional Interpretation , 1999 .

[27]  Klaus J. Kohler,et al.  Pragmatic and Attitudinal Meanings of Pitch Patterns in German Syntactically Marked Questions ∗ , 2005 .

[28]  Géza Németh,et al.  Prosody generation for German CTS/TTS systems (from theoretical intonation patterns to practical realisation) , 1997, Speech Commun..

[29]  Eduardo López,et al.  Improvement on connected numbers recognition using prosodic information , 1998, ICSLP.

[30]  David House Final rises in spontaneous Swedish computer-directed questions : incidence and function , 2004 .

[31]  Rosalind W. Picard,et al.  Dialog Act Classification from Prosodic Features Using Support Vector Machines , 2002 .

[32]  Larry P. Heck,et al.  A lognormal tied mixture model of pitch for prosody based speaker recognition , 1997, EUROSPEECH.

[33]  Shrikanth S. Narayanan,et al.  Toward detecting emotions in spoken dialogs , 2005, IEEE Transactions on Speech and Audio Processing.

[34]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[35]  Stephanie Seneff,et al.  Improvements in speech understanding accuracy through the integration of hierarchical linguistic, prosodic, and phonological constraints in the jupiter domain , 1998, ICSLP.

[36]  Jiahong Yuan,et al.  Detection of questions in Chinese conversational speech , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..