Speaker Agnostic Foreground Speech Detection from Audio Recordings in Workplace Settings from Wearable Recorders

Audio-signal acquisition as part of wearable sensing adds an important dimension for applications such as understanding human behaviors. As part of a large study on work place behaviours, we collected audio data from individual hospital staff using custom wearable recorders. The audio features collected were limited to preserve privacy of the interactions in the hospital. A first step towards audio processing is to identify the foreground speech of the person wearing the audio badge. This task is challenging because of the multi-party nature of possible ambulatory interactions, lack of access to speaker information and varying channel and ambient conditions. In this paper, we present a speaker-agnostic approach to foreground detection. We propose a convolutional neural network model to predict foreground regions using a limited set of audio features. We show that these models generalize across the proxy corpora we collected in house to approximately match the deployment environment. The proxy corpora contained full audio and was used as a test-bed to analyze our models in greater detail. We also evaluated the models in the workplace setting to measure speech activity. Our experimental results show promising direction for analyzing workplace behaviors with privacy protected sensing.

[1]  Daniel P. W. Ellis,et al.  Speech separation using speaker-adapted eigenvoice speech models , 2010, Comput. Speech Lang..

[2]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[3]  Daniel P. W. Ellis,et al.  A variational EM algorithm for learning eigenvoice parameters in mixed signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  C.-C. Jay Kuo,et al.  A semi-supervised learning approach to online audio background detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[6]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Gerhard Tröster,et al.  Robust voice activity detection for social sensing , 2013, UbiComp.

[8]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  M. E. Johnson,et al.  Institutionalizing HIPAA Compliance , 2014, Journal of health and social behavior.

[10]  Tanja Schultz,et al.  Crosscorrelation-based multispeaker speech activity detection , 2004, INTERSPEECH.

[11]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Alex Pentland,et al.  Smart headphones: enhancing auditory awareness through robust speech detection and source localization , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Mark Liberman,et al.  Speech activity detection on youtube using deep neural networks , 2013, INTERSPEECH.

[14]  Björn Schuller,et al.  Opensmile: the munich versatile and fast open-source audio feature extractor , 2010, ACM Multimedia.

[15]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Mark D. Plumbley,et al.  Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network , 2015, LVA/ICA.

[17]  Shrikanth S. Narayanan,et al.  TILES audio recorder: an unobtrusive wearable solution to track audio activity , 2018, WearSys@MobiSys.

[18]  DeLiang Wang,et al.  A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Pejman Mowlaee Beikzadehmahalen New Stategies for Single-channel Speech Separation , 2010 .

[20]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[21]  Wei Pan,et al.  SoundSense: scalable sound sensing for people-centric applications on mobile phones , 2009, MobiSys '09.

[22]  John M. Pecarina,et al.  Improving optimization of convolutional neural networks through parameter fine-tuning , 2017, Neural Computing and Applications.

[23]  J. Pennebaker,et al.  The Electronically Activated Recorder (EAR): A device for sampling naturalistic daily activities and conversations , 2001, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.