A SUPERVISED FACTORIAL ACOUSTIC MODEL FOR SIMULTANEOUS MULTIPARTICIPANT VOCAL ACTIVITY DETECTION IN CLOSE-TALK MICROPHONE RECORDINGS OF MEETINGS

Using automatic speech recognition (ASR) word error rates ( WERs) as a metric, the systems in (1) and (3) appear to have yiel d similar performance, in spite of significant additional arc hitectural differences. Systems of type (2) have not been fie lded for segmentation for ASR, and therefore cannot be directly comp ared. Although approaches of type (3) offer a significant advantag e, namely the opportunity to directly constrain the number o f simultaneously vocalizing participants, they come with the c aveat of a variable acoustic vector size, since conversatio ns/meetings can have variable numbers of participants. To overcome this difficulty, unsupervised acoustic models have been deploye d [4], which do not require acoustic model training data (or traini ng time). Our previous work has shown that this severly limit s the number of features, as well as the minimum frame size. The aim of the current work is to develop a supervised acoustic model , capable of producing accurate density estimates for large f eature vectors extracted from short frames, for scenario (3 ).

[1]  Tanja Schultz,et al.  Simultaneous multispeaker segmentation for automatic meeting recognition , 2007, 2007 15th European Signal Processing Conference.

[2]  Elizabeth Shriberg,et al.  The ICSI Meeting Recorder Dialog Act (MRDA) Corpus , 2004, SIGDIAL Workshop.

[3]  Kornel Laskowski,et al.  The ISL RT-06S Speech-to-Text System , 2006, MLMI.

[4]  Andreas Stolcke,et al.  Improved speech activity detection using cross-channel features for recognition of multiparty meetings , 2006, INTERSPEECH.

[5]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[6]  Tanja Schultz,et al.  Modeling Vocal Interaction for Segmentation in Meeting Recognition , 2007, MLMI.

[7]  Tanja Schultz,et al.  A Geometric Interpretation of Non-Target-Normalized Maximum Cross-Channel Correlation for Vocal Activity Detection in Meetings , 2007, HLT-NAACL.

[8]  Susanne Burger,et al.  The ISL meeting corpus: the impact of meeting type on speech style , 2002, INTERSPEECH.

[9]  Elizabeth Shriberg,et al.  Overlap in Meetings: ASR Effects and Analysis by Dialog Factors, Speakers, and Collection Site , 2006, MLMI.

[10]  Tanja Schultz,et al.  Unsupervised Learning of Overlapped Speech Model Parameters For Multichannel Speech Activity Detection in Meetings , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Mary P. Harper,et al.  Speech Activity Detection on Multichannels of Meeting Recordings , 2005, MLMI.