Modeling Vocal Interaction for Segmentation in Meeting Recognition

Automatic segmentation is an important technology for both automatic speech recognition and automatic speech understanding. In meetings, participants typically vocalize for only a fraction of the recorded time, but standard vocal activity detection algorithms for closetalk microphones in meetings continue to treat participants independently. In this work we present a multispeaker segmentation system which models a particular aspect of human-human communication, that of vocal interaction or the interdependence between participants' on-off speech patterns. We describe our vocal interaction model, its training, and its use during vocal activity decoding. Our experiments show that this approach almost completely eliminates the problem of crosstalk, and word error rates on our development set are lower than those obtained with human-generatated reference segmentation. We also observe significant performance improvements on unseen data.

[1]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  S McGuire,et al.  Genetic and Environmental Contributions to Loneliness in Children , 2000, Psychological science.

[3]  Tanja Schultz,et al.  Unsupervised Learning of Overlapped Speech Model Parameters For Multichannel Speech Activity Detection in Meetings , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Andreas Stolcke,et al.  Improved speech activity detection using cross-channel features for recognition of multiparty meetings , 2006, INTERSPEECH.

[5]  Susanne Burger,et al.  The ISL meeting corpus: the impact of meeting type on speech style , 2002, INTERSPEECH.

[6]  Tanja Schultz,et al.  A Geometric Interpretation of Non-Target-Normalized Maximum Cross-Channel Correlation for Vocal Activity Detection in Meetings , 2007, HLT-NAACL.

[7]  Guy J. Brown,et al.  Feature selection for the classification of crosstalk in multi-channel audio , 2003, INTERSPEECH.

[8]  E. Schegloff,et al.  A simplest systematics for the organization of turn-taking for conversation , 1974 .

[9]  Tanja Schultz,et al.  Crosscorrelation-based multispeaker speech activity detection , 2004, INTERSPEECH.

[10]  Andrei Popescu-Belis,et al.  Machine Learning for Multimodal Interaction , 4th International Workshop, MLMI 2007, Brno, Czech Republic, June 28-30, 2007, Revised Selected Papers , 2008, MLMI.

[11]  Elizabeth Shriberg,et al.  Overlap in Meetings: ASR Effects and Analysis by Dialog Factors, Speakers, and Collection Site , 2006, MLMI.

[12]  Jithendra Vepa,et al.  The segmentation of multi-channel meeting recordings for automatic speech recognition , 2006, INTERSPEECH.

[13]  Kornel Laskowski,et al.  The ISL RT-06S Speech-to-Text System , 2006, MLMI.

[14]  S. Garrod,et al.  Group Discussion as Interactive Dialogue or as Serial Monologue: The Influence of Group Size , 2000, Psychological science.

[15]  J. M. Dabbs,et al.  Dimensions of Group Process: Amount and Structure of Vocal Interaction , 1987 .