Speech Discrimination in Real-World Group Communication Using Audio-Motion Multimodal Sensing

Speech discrimination that determines whether a participant is speaking at a given moment is essential in investigating human verbal communication. Specifically, in dynamic real-world situations where multiple people participate in, and form, groups in the same space, simultaneous speakers render speech discrimination that is solely based on audio sensing difficult. In this study, we focused on physical activity during speech, and hypothesized that combining audio and physical motion data acquired by wearable sensors can improve speech discrimination. Thus, utterance and physical activity data of students in a university participatory class were recorded, using smartphones worn around their neck. First, we tested the temporal relationship between manually identified utterances and physical motions and confirmed that physical activities in wide-frequency ranges co-occurred with utterances. Second, we trained and tested classifiers for each participant and found a higher performance with the audio-motion classifier (average accuracy 92.2%) than both the audio-only (80.4%) and motion-only (87.8%) classifiers. Finally, we tested inter-individual classification and obtained a higher performance with the audio-motion combined classifier (83.2%) than the audio-only (67.7%) and motion-only (71.9%) classifiers. These results show that audio-motion multimodal sensing using widely available smartphones can provide effective utterance discrimination in dynamic group communications.

[1]  Kazuo Yano,et al.  Knowledge-creating behavior index for improving knowledge workers' productivity , 2009, 2009 Sixth International Conference on Networked Sensing Systems (INSS).

[2]  Sileye O. Ba,et al.  Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features , 2010, ICASSP.

[3]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[4]  Alex Pentland,et al.  Mining Face-to-Face Interaction Networks using Sociometric Badges: Predicting Productivity in an IT Configuration Task , 2008, ICIS.

[5]  Daniel Olgu ´ õn,et al.  Wearable Communicator Badge: Designing a New Platform for Revealing Organizational Dynamics , 2006 .

[6]  Radu Horaud,et al.  Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  M. Kendall The treatment of ties in ranking problems. , 1945, Biometrika.

[10]  Gerald Friedland,et al.  The ICSI RT-09 Speaker Diarization System , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Nasser Kehtarnavaz,et al.  A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection , 2018, IEEE Access.

[12]  R. Krauss Why Do We Gesture When We Speak? , 1998 .

[13]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[15]  P. McCullagh Analysis of Ordinal Categorical Data , 1985 .

[16]  J. Harrigan,et al.  LISTENERS' BODY MOVEMENTS AND SPEAKING TURNS , 1985 .

[17]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[18]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Muhammad Usman Ilyas,et al.  Activity recognition using smartphone sensors , 2013, 2013 IEEE 10th Consumer Communications and Networking Conference (CCNC).

[20]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[22]  A. Dittmann,et al.  Body movement and speech rhythm in social conversation. , 1969, Journal of personality and social psychology.

[23]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[24]  Hervé Bourlard,et al.  Audio-visual synchronisation for speaker diarisation , 2010, INTERSPEECH.

[25]  Géraldine Damnati,et al.  Robust speech/non-speech detection using LDA applied to MFCC , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[26]  Yusuke Kida,et al.  Voice Activity Detection: Merging Source and Filter-based Information , 2016, IEEE Signal Processing Letters.

[27]  Inseok Hwang,et al.  SocioPhone: everyday face-to-face interaction monitoring platform using multi-phone sensor fusion , 2013, MobiSys '13.

[28]  Cláudio Rosito Jung,et al.  Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs , 2014, IEEE Transactions on Multimedia.

[29]  Gwenn Englebienne,et al.  Classifying social actions with a single accelerometer , 2013, UbiComp.

[30]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[31]  Brigitte Bigi,et al.  SPPAS - MULTI-LINGUAL APPROACHES TO THE AUTOMATIC ANNOTATION OF SPEECH , 2015 .

[32]  Alex Pentland,et al.  Sensing the "Health State" of a Community , 2012, IEEE Pervasive Computing.

[33]  Gary M. Weiss,et al.  Activity recognition using cell phone accelerometers , 2011, SKDD.

[34]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  John H. L. Hansen,et al.  Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux , 2013, IEEE Signal Processing Letters.

[36]  Marc Moonen,et al.  Energy-based multi-speaker voice activity detection with an ad hoc microphone array , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.