Wearable Audio Monitoring: Content-Based Processing Methodology and Implementation

Developing audio processing tools for extracting social-audio features are just as important as conscious content for determining human behavior. Psychologists speculate these features may have evolved as a way to establish hierarchy and group cohesion because they function as a subconscious discussion about relationships, resources, risks, and rewards. In this paper, we present the design, implementation, and deployment of a wearable computing platform capable of automatically extracting and analyzing social-audio signals. Unlike conventional research that concentrates on data which have been recorded under constrained conditions, our data were recorded in completely natural and unpredictable situations. In particular, we benchmarked a set of integrated algorithms (sound speech detection and classification, sound level meter calculation, voice and nonvoice segmentation, speaker segmentation, and prediction) to obtain speech and environmental sound social-audio signals using an in-house built wearable device. In addition, we derive a novel method that incorporates the recently published audio feature extraction technique based on power normalized cepstral coefficient and gap statistics for speaker segmentation and prediction. The performance of the proposed integrated platform is robust to natural and unpredictable situations. Experiments show that the method has successfully segmented natural speech with 89.6% accuracy.

[1]  David A. van Leeuwen,et al.  Large-Scale Speaker Diarization for Long Recordings and Small Collections , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Bin Ma,et al.  Speaker Clustering and Cluster Purification Methods for RT07 and RT09 Evaluation Meeting Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Richard M. Stern,et al.  Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Alex Pentland,et al.  Sensible Organizations: Technology and Methodology for Automatically Measuring Organizational Behavior , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[5]  Alex Pentland,et al.  Socially aware, computation and communication , 2005, Computer.

[6]  R. Cowie,et al.  A new emotion database: considerations, sources and scope , 2000 .

[7]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[8]  Richard M. Stern,et al.  Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction , 2009, INTERSPEECH.

[9]  R. Friedman,et al.  Bargainer Characteristics in Distributive and Integrative Negotiation , 1998 .

[10]  Roddy Cowie,et al.  Emotional speech: Towards a new generation of databases , 2003, Speech Commun..

[11]  Katashi Nagao,et al.  The world through the computer: computer augmented interaction with real world environments , 1995, UIST '95.

[12]  A. Pentland,et al.  Thin slices of negotiation: predicting outcomes from conversational dynamics within the first 5 minutes. , 2007, The Journal of applied psychology.

[13]  Jie Liu,et al.  SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones , 2011, Pervasive.

[14]  M. J. Cheng,et al.  Comparative performance study of several pitch detection algorithms , 1975 .

[15]  Jordi Luque,et al.  Simultaneous Speech Detection With Spatial Features for Speaker Diarization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[17]  Eric Keller,et al.  Prosodic aspects of speech , 1995 .

[18]  W. Patterson,et al.  A hardware accelerator for smart information systems , 1993, 1993 Computer Architectures for Machine Perception.

[19]  Jeff A. Bilmes,et al.  Towards the automated social analysis of situated speech data , 2008, UbiComp.

[20]  Shrikanth S. Narayanan,et al.  Robust Voice Activity Detection Using Long-Term Signal Variability , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Gerald Friedland,et al.  Estimating Dominance in Multi-Party Meetings Using Speaker Diarization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  A. Pentland Social Signal Processing [Exploratory DSP] , 2007, IEEE Signal Processing Magazine.

[23]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[24]  Kevin Fiedler,et al.  Grooming Gossip And The Evolution Of Language , 2016 .

[25]  Mari Ostendorf,et al.  Fast algorithms for phone classification and recognition using segment-based models , 1992, IEEE Trans. Signal Process..

[26]  K. Fischer,et al.  DESPERATELY SEEKING EMOTIONS OR: ACTORS, WIZARDS, AND HUMAN BEINGS , 2000 .

[27]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[28]  Sridha Sridharan,et al.  The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[30]  Lars Kai Hansen,et al.  Unsupervised speaker change detection for broadcast news segmentation , 2006, 2006 14th European Signal Processing Conference.

[31]  Sadaoki Furui,et al.  Speaker recognition , 1997, Scholarpedia.

[32]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[33]  Alexander I. Rudnicky,et al.  SPEECHWEAR: a mobile speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34]  Benjamin B. Bederson,et al.  Audio augmented reality: a prototype automated tour guide , 1995, CHI 95 Conference Companion.

[35]  D. O'Shaughnessy,et al.  Speaker recognition , 1986, IEEE ASSP Magazine.

[36]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[37]  Xiao-Lei Zhang,et al.  Deep Belief Networks Based Voice Activity Detection , 2013, IEEE Transactions on Audio, Speech, and Language Processing.