Human activity classification based on sound recognition and residual convolutional neural network

Abstract Human activity recognition is crucial for a better understanding of workers in construction sites and people in the built environment. Previous studies have been proposed various ways in which sensing and machine learning techniques can be utilized to collect human activity data automatically. Sound recognition has the potential to be utilized in ways that complement the limitations of the previous methods because sound signals are easy to propagate in indoor environments where many physical obstacles exist, and this method can simultaneously recognize not only sounds from human activities but also sounds from related objects. Therefore, this study develops a sound recognition-based human activity classification model using a residual neural network. A sound data is collected based on ten classes representing people's daily activities in the indoor environment. Then, the features of the sound data were extracted using the Log Mel-filter bank energies method, and a residual neural network model with 34 convolutional layers was trained using the data. The results showed the following: the accuracy of the model was 87.6%, and the Precision score for each class ranged from 76.8% to 92.6%, the Recall scores ranged from 75.8% to 98.6%, and the F1-score ranged from 78.6% to 93.7%. The contribution of this study is to demonstrate that sound recognition can classify people's indoor activities successfully, but this study leaves the limitation that it is based on a monophonic method that only one activity can be classified at a time.

[1]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Tuomas Virtanen,et al.  Filterbank learning for deep neural network based polyphonic sound event detection , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[3]  Paul J. M. Havinga,et al.  A Survey on the Feasibility of Sound Classification on Wireless Sensor Nodes , 2015, Sensors.

[4]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[5]  Bing Dong,et al.  Building energy and comfort management through occupant behaviour pattern detection based on a large-scale environmental sensor network , 2011 .

[6]  Chidchanok Lursinsap,et al.  Impulsive Environment Sound Detection by Neural Classification of Spectrogram and Mel-Frequency Coefficient Images , 2010 .

[7]  Tuomas Virtanen,et al.  Acoustic event detection in real life recordings , 2010, 2010 18th European Signal Processing Conference.

[8]  Mark D. Plumbley,et al.  Computational Analysis of Sound Scenes and Events , 2017 .

[9]  Steve Lawrence,et al.  Artist detection in music with Minnowmatch , 2001, Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584).

[10]  Amir H. Behzadan,et al.  Smartphone-based construction workers' activity recognition and classification , 2016 .

[11]  Kornel Laskowski,et al.  Emotion recognition in spontaneous speech using GMMs , 2006, INTERSPEECH.

[12]  Mark D. Plumbley,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[13]  Talal Rahwan,et al.  Automatic HVAC Control with Real-time Occupancy Recognition and Simulation-guided Model Predictive Control in Low-cost Embedded System , 2017, ArXiv.

[14]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[15]  Waldo Nogueira,et al.  Recurrence quantification analysis features for environmental sound recognition , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[16]  Ray Meddis,et al.  Computational models of the auditory system , 2010 .

[17]  Xiaoli Li,et al.  Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition , 2015, IJCAI.

[18]  Huy Phan,et al.  Improved Audio Scene Classification Based on Label-Tree Embeddings and Convolutional Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Jae-Young Pyun,et al.  Deep Recurrent Neural Networks for Human Activity Recognition , 2017, Sensors.

[20]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[21]  Heikki Huttunen,et al.  Polyphonic sound event detection using multi label deep neural networks , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[22]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[23]  Heikki Huttunen,et al.  Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Ian Kaminskyj,et al.  Automatic source identification of monophonic musical instrument sounds , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[25]  Zhigang Zeng,et al.  Advances in Neural Network Research and Applications , 2010 .

[26]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2019 Challenge: Closed and Open Set Classification and Data Mismatch Setups , 2019, DCASE.

[27]  Mani Golparvar-Fard,et al.  End-to-end vision-based detection, tracking and activity analysis of earthmoving equipment filmed at ground level , 2019, Automation in Construction.

[28]  Dan Stowell,et al.  Acoustic Scene Classification: Classifying environments from the sounds they produce , 2014, IEEE Signal Processing Magazine.

[29]  N. Scaringella,et al.  Automatic genre classification of music content: a survey , 2006, IEEE Signal Process. Mag..

[30]  Heikki Huttunen,et al.  Recognition of acoustic events using deep neural networks , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[31]  C.-C. Jay Kuo,et al.  Environmental sound recognition: A survey , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[32]  Heikki Huttunen,et al.  Recurrent neural networks for polyphonic sound event detection in real life recordings , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Florian Metze,et al.  Audio-based multimedia event detection using deep recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Vesa T. Peltonen,et al.  Computational auditory scene recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Annamaria Mesaros,et al.  Sound Event Detection in Multisource Environments Using Source Separation , 2011 .

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ling Guan,et al.  A neural network approach for human emotion recognition in speech , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[38]  Andrzej Czyzewski,et al.  Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations , 2015, Multimedia Tools and Applications.

[39]  Bo Yu,et al.  Convolutional Neural Networks for human activity recognition using mobile sensors , 2014, 6th International Conference on Mobile Computing, Applications and Services.

[40]  Ryohei Nakatsu,et al.  Emotion Recognition in Speech Using Neural Networks , 2000, Neural Computing & Applications.

[41]  Dan Stowell,et al.  Approaches to Complex Sound Scene Analysis , 2018 .

[42]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[44]  Haizhou Li,et al.  Spectrogram Image Feature for Sound Event Classification in Mismatched Conditions , 2011, IEEE Signal Processing Letters.

[45]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Keansub Lee,et al.  Minimal-impact audio-based personal archives , 2004, CARPE'04.

[47]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[48]  Stefano Squartini,et al.  A convolutional neural network approach for acoustic scene classification , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[49]  Renate Sitte,et al.  Comparison of techniques for environmental sound recognition , 2003, Pattern Recognit. Lett..

[50]  Jae-Hun Kim,et al.  Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[51]  J. L. Gomez Ortega,et al.  A machine-learning based approach to model user occupancy and activity patterns for energy saving in buildings , 2015, 2015 Science and Information Conference (SAI).

[52]  Ray Meddis,et al.  Auditory Periphery: From Pinna to Auditory Nerve , 2010 .

[53]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yi Wang,et al.  Robust Indoor Human Activity Recognition Using Wireless Signals , 2015, Sensors.

[56]  Tuomas Virtanen,et al.  Audio context recognition using audio event histograms , 2010, 2010 18th European Signal Processing Conference.

[57]  Juha T. Tuomi,et al.  Audio-based context awareness - acoustic modeling and perceptual evaluation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[58]  Gunhee Kim,et al.  SplitNet: Learning to Semantically Split Deep Networks for Parameter Reduction and Model Parallelization , 2017, ICML.

[59]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[60]  Mani Golparvar-Fard,et al.  Vision-based workface assessment using depth images for activity analysis of interior construction operations , 2014 .

[61]  Hongying Bao,et al.  Large Scale Classification in Deep Neural Network with Label Mapping , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[62]  Catherine J. Stevens,et al.  Extracting Meaning from Sound: Nomic Mappings, Everyday Listening, and Perceiving Object Size from Frequency , 2004 .