A hybrid barge-in procedure for more reliable turn-taking in human-machine dialog systems

This paper investigates techniques designed to allow the users of human-machine dialog systems to interrupt or barge-in over machine generated speech messages. An experimental study was performed on utterances collected from a telephone based dialog system to analyze the effect of barge-in performance on users' speech. One result of this study was that excessive barge-in latencies resulted in disfluencies appearing in over half of users' utterances. A hybrid procedure for barge-in detection is proposed and evaluated on the utterances collected from the same domain. The procedure combines a feature-based voice activity detection (VAD) algorithm with a model-based approach for verifying hypothesized speech segments. The procedure is shown in the paper to obtain better detection performance than procedures that rely on the speech recognition decoder to detect speech. It is also found to have latencies that are comparable to those obtained by low delay feature-based speech detection algorithms.