Localising speech, footsteps and other sounds using resource-constrained devices

While a number of acoustic localisation systems have been proposed over the last few decades, these have typically either relied on expensive dedicated microphone arrays and workstation-class processing, or have been developed to detect a very specific type of sound in a particular scenario. However, as people live and work indoors, they generate a wide variety of sounds as they interact and move about. These human-generated sounds can be used to infer the positions of people, without requiring them to wear trackable tags. In this paper, we take a practical yet general approach to localising a number of human-generated sounds. Drawing from signal processing literature, we identify methods for resource-constrained devices in a sensor network to detect, classify and locate acoustic events such as speech, footsteps and objects being placed onto tables. We evaluate the classification and time-of-arrival estimation algorithms using a data set of human-generated sounds we captured with sensor nodes in a controlled setting. We show that despite the variety and complexity of the sounds, their localisation is feasible for sensor networks, with typical accuracies of a half metre or better. We specifically discuss the processing and networking considerations, and explore the performance trade-offs which can be made to further conserve resources.

[1]  Sharon Gannot,et al.  Time difference of arrival estimation of speech source in a noisy and reverberant environment , 2005, Signal Process..

[2]  T. Ajdler,et al.  Acoustic source localization in distributed sensor networks , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[3]  Peter Kabal,et al.  Speech/music discrimination for multimedia applications , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Bruce H. Krogh,et al.  Lightweight detection and classification for wireless sensor networks in realistic environments , 2005, SenSys '05.

[5]  Jacob Benesty,et al.  Time Delay Estimation in Room Acoustic Environments: An Overview , 2006, EURASIP J. Adv. Signal Process..

[6]  M S Brandstein Time-delay estimation of reverberated speech exploiting harmonic structure. , 1999, The Journal of the Acoustical Society of America.

[7]  Stan Z. Li,et al.  Content-based audio classification and retrieval using the nearest feature line method , 2000, IEEE Trans. Speech Audio Process..

[8]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[9]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[10]  Jeroen Breebaart,et al.  Features for Audio Classification , 2004 .

[11]  Jhing-Fa Wang,et al.  Chip design of MFCC extraction for speech recognition , 2002, Integr..

[12]  Richard S. Goldhor,et al.  Recognition of environmental sounds , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[14]  Mohan S. Kankanhalli,et al.  Audio Based Event Detection for Multimedia Surveillance , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[15]  R. E. Hudson,et al.  Acoustic sensor networks for woodpecker localization , 2005, SPIE Optics + Photonics.

[16]  Kung Yao,et al.  An Empirical Study of Collaborative Acoustic Source Localization , 2007, 2007 6th International Symposium on Information Processing in Sensor Networks.

[17]  Wei Pan,et al.  SoundSense: scalable sound sensing for people-centric applications on mobile phones , 2009, MobiSys '09.

[18]  W. Press,et al.  Numerical Recipes in Fortran: The Art of Scientific Computing.@@@Numerical Recipes in C: The Art of Scientific Computing. , 1994 .

[19]  James Scott,et al.  Audio Location: Accurate Low-Cost Location Sensing , 2005, Pervasive.

[20]  Sadaoki Furui,et al.  Advances in Speech Signal Processing , 1991 .

[21]  Tapio Lokki,et al.  AN EYES-FREE USER INTERFACE CONTROLLED BY FINGER SNAPS , 2005 .

[22]  R. J. Martin,et al.  Autoregressive modelling in vector spaces: An application to narrow-bandwidth spectral estimation , 1996, Signal Process..

[23]  Sanjay Jha,et al.  The design and evaluation of a hybrid sensor network for Cane-Toad monitoring , 2005 .

[24]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[25]  John Saunders,et al.  Real-time discrimination of broadcast speech/music , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26]  Jacob Benesty,et al.  Performance of GCC- and AMDF-Based Time-Delay Estimation in Practical Reverberant Environments , 2005, EURASIP J. Adv. Signal Process..

[27]  Deborah Estrin,et al.  The design and implementation of a self-calibrating distributed acoustic sensing platform , 2006, SenSys '06.

[28]  Chloé Clavel,et al.  Events Detection for an Audio-Based Surveillance System , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[29]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[30]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  James M. Rehg,et al.  Using Sound Source Localization in a Home Environment , 2005, Pervasive.

[32]  Roberto Cusani,et al.  Performance of fast time delay estimators , 1989, IEEE Trans. Acoust. Speech Signal Process..

[33]  Benoît Champagne,et al.  Performance of time-delay estimation in the presence of room reverberation , 1996, IEEE Trans. Speech Audio Process..

[34]  Michael S. Brandstein,et al.  A practical methodology for speech source localization with microphone arrays , 1997, Comput. Speech Lang..

[35]  Janto Skowronek,et al.  Automatic surveillance of the acoustic activity in our living environment , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[36]  Augusto Sarti,et al.  Scream and gunshot detection and localization for audio-surveillance systems , 2007, 2007 IEEE Conference on Advanced Video and Signal Based Surveillance.