论文信息 - Computationally-efficient endpointing features for natural spoken interaction with personal-assistant systems

Computationally-efficient endpointing features for natural spoken interaction with personal-assistant systems

Current speech-input systems typically use a nonspeech threshold for end-of-utterance detection. While usually sufficient for short utterances, the approach can cut speakers off during pauses in more complex utterances. We elicit personal-assistant speech (reminders, calendar entries, messaging, search) using a recognizer with a dramatically increased endpoint threshold, and find frequent nonfinal pauses. A standard endpointer with a 500 ms threshold (latency) results in a 36% cutoff rate for this corpus. Based on the new data, we develop low-cost acoustic features to discriminate nonfinal from final pauses. Features capture periodicity, speaking rate, spectral constancy, duration/intensity, and pitch of prepausal speech - using no speech recognition, speaker or session information. Classification experiments yield 20% EER at a 100 ms latency, thereby reducing both cutoffs and latency compared with the threshold-only baseline. Additional results on computational cost, feature importance, and speaker differences are discussed.

Elizabeth Shriberg | Umut Ozertem | Harish Arsikere

[1] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2] Ashish Verma,et al. Formant-based technique for automatic filled-pause detection in spontaneous spoken english , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Andreas Stolcke,et al. A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.

[4] Shrikanth S. Narayanan,et al. A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice , 2013, INTERSPEECH.

[5] D. O'Shaughnessy,et al. Recognition of hesitations in spontaneous speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6] Andreas Stolcke,et al. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7] Kristin Precoda,et al. Identifying Agreement/Disagreement in Conversational Speech: A Cross-Lingual Study , 2011, INTERSPEECH.

[8] Okko Johannes Räsänen,et al. Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech , 2013, INTERSPEECH.

[9] Y. Qi,et al. Temporal and spectral estimations of harmonics-to-noise ratio in human voice signals. , 1997, The Journal of the Acoustical Society of America.

[10] Daqing He,et al. How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search , 2013, SIGIR.

[11] Anil K. Jain. Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[12] Kornel Laskowski,et al. Measuring Final Lengthening for Speaker-Change Prediction , 2011, INTERSPEECH.

[13] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[14] Meysam Asgari,et al. Robust and accurate features for detecting and diagnosing autism spectrum disorders , 2013, INTERSPEECH.

[15] Hiroshi Ishiguro,et al. Analysis of Acoustic-Prosodic Features Related to Paralinguistic Information Carried by Interjections in Dialogue Speech , 2011, INTERSPEECH.

[16] Björn Schuller,et al. Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[17] Rahul Gupta,et al. Paralinguistic event detection from speech using probabilistic time-series smoothing and masking , 2013, INTERSPEECH.

[18] Juha Häkkinen,et al. Robust end-of-utterance detection for real-time speech recognition applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[19] Róbert Busa-Fekete,et al. Detecting autism, emotions and social signals using adaboost , 2013, INTERSPEECH.

[20] Andreas Stolcke,et al. A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[21] Shrikanth S. Narayanan,et al. Automatic classification of question turns in spontaneous speech using lexical and prosodic evidence , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Andreas Stolcke,et al. Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[23] Helena Moniz,et al. Disfluency detection based on prosodic features for university lectures , 2013, INTERSPEECH.