Computationally-efficient endpointing features for natural spoken interaction with personal-assistant systems

Current speech-input systems typically use a nonspeech threshold for end-of-utterance detection. While usually sufficient for short utterances, the approach can cut speakers off during pauses in more complex utterances. We elicit personal-assistant speech (reminders, calendar entries, messaging, search) using a recognizer with a dramatically increased endpoint threshold, and find frequent nonfinal pauses. A standard endpointer with a 500 ms threshold (latency) results in a 36% cutoff rate for this corpus. Based on the new data, we develop low-cost acoustic features to discriminate nonfinal from final pauses. Features capture periodicity, speaking rate, spectral constancy, duration/intensity, and pitch of prepausal speech - using no speech recognition, speaker or session information. Classification experiments yield 20% EER at a 100 ms latency, thereby reducing both cutoffs and latency compared with the threshold-only baseline. Additional results on computational cost, feature importance, and speaker differences are discussed.

[1]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2]  Ashish Verma,et al.  Formant-based technique for automatic filled-pause detection in spontaneous spoken english , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Andreas Stolcke,et al.  A prosody only decision-tree model for disfluency detection , 1997, EUROSPEECH.

[4]  Shrikanth S. Narayanan,et al.  A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice , 2013, INTERSPEECH.

[5]  D. O'Shaughnessy,et al.  Recognition of hesitations in spontaneous speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Andreas Stolcke,et al.  Enriching speech recognition with automatic detection of sentence boundaries and disfluencies , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Kristin Precoda,et al.  Identifying Agreement/Disagreement in Conversational Speech: A Cross-Lingual Study , 2011, INTERSPEECH.

[8]  Okko Johannes Räsänen,et al.  Random subset feature selection in automatic recognition of developmental disorders, affective states, and level of conflict from speech , 2013, INTERSPEECH.

[9]  Y. Qi,et al.  Temporal and spectral estimations of harmonics-to-noise ratio in human voice signals. , 1997, The Journal of the Acoustical Society of America.

[10]  Daqing He,et al.  How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search , 2013, SIGIR.

[11]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[12]  Kornel Laskowski,et al.  Measuring Final Lengthening for Speaker-Change Prediction , 2011, INTERSPEECH.

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Meysam Asgari,et al.  Robust and accurate features for detecting and diagnosing autism spectrum disorders , 2013, INTERSPEECH.

[15]  Hiroshi Ishiguro,et al.  Analysis of Acoustic-Prosodic Features Related to Paralinguistic Information Carried by Interjections in Dialogue Speech , 2011, INTERSPEECH.

[16]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[17]  Rahul Gupta,et al.  Paralinguistic event detection from speech using probabilistic time-series smoothing and masking , 2013, INTERSPEECH.

[18]  Juha Häkkinen,et al.  Robust end-of-utterance detection for real-time speech recognition applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[19]  Róbert Busa-Fekete,et al.  Detecting autism, emotions and social signals using adaboost , 2013, INTERSPEECH.

[20]  Andreas Stolcke,et al.  A prosody-based approach to end-of-utterance detection that does not require speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[21]  Shrikanth S. Narayanan,et al.  Automatic classification of question turns in spontaneous speech using lexical and prosodic evidence , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Andreas Stolcke,et al.  Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody , 2002, INTERSPEECH.

[23]  Helena Moniz,et al.  Disfluency detection based on prosodic features for university lectures , 2013, INTERSPEECH.