Audio-Based Activities of Daily Living (ADL) Recognition with Large-Scale Acoustic Embeddings from Online Videos

Over the years, activity sensing and recognition has been shown to play a key enabling role in a wide range of applications, from sustainability and human-computer interaction to health care. While many recognition tasks have traditionally employed inertial sensors, acoustic-based methods offer the benefit of capturing rich contextual information, which can be useful when discriminating complex activities. Given the emergence of deep learning techniques and leveraging new, large-scale multimedia datasets, this paper revisits the opportunity of training audio-based classifiers without the onerous and time-consuming task of annotating audio data. We propose a framework for audio-based activity recognition that can make use of millions of embedding features from public online video sound clips. Based on the combination of oversampling and deep learning approaches, our framework does not require further feature processing or outliers filtering as in prior work. We evaluated our approach in the context of Activities of Daily Living (ADL) by recognizing 15 everyday activities with 14 participants in their own homes, achieving 64.2% and 83.6% averaged within-subject accuracy in terms of top-1 and top-3 classification respectively. Individual class performance was also examined in the paper to further study the co-occurrence characteristics of the activities and the robustness of the framework.

[1]  Vesa T. Peltonen,et al.  Audio-based context recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Artur Dubrawski,et al.  Classification of Time Sequences using Graphs of Temporal Constraints , 2017, J. Mach. Learn. Res..

[3]  Hanghang Tong,et al.  Activity recognition with smartphone sensors , 2014 .

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Joydeep Ghosh,et al.  Generative Oversampling for Mining Imbalanced Datasets , 2007, DMIN.

[6]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[7]  Gary M. Weiss,et al.  Activity recognition using cell phone accelerometers , 2011, SKDD.

[8]  Michael L. Littman,et al.  Activity Recognition from Accelerometer Data , 2005, AAAI.

[9]  Ulf Blanke,et al.  Combining crowd-generated media and personal data: semi-supervised learning for context recognition , 2013, PDM '13.

[10]  Gerhard Tröster,et al.  AmbientSense: A real-time ambient sound recognition system for smartphones , 2013, 2013 IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOM Workshops).

[11]  A. Azzouz 2011 , 2020, City.

[12]  Tarek F. Abdelzaher,et al.  GreenGPS: a participatory sensing fuel-efficient maps application , 2010, MobiSys '10.

[13]  Oscar Mayora-Ibarra,et al.  Smartphone-Based Recognition of States and State Changes in Bipolar Disorder Patients , 2015, IEEE Journal of Biomedical and Health Informatics.

[14]  Chris D. Nugent,et al.  A Knowledge-Driven Approach to Activity Recognition in Smart Homes , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Salil S. Kanhere,et al.  A survey on privacy in mobile participatory sensing applications , 2011, J. Syst. Softw..

[16]  Xavier Serra,et al.  Freesound technical demo , 2013, ACM Multimedia.

[17]  Nicholas D. Lane,et al.  DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning , 2015, UbiComp.

[18]  Michael S. Bernstein,et al.  Augur: Mining Human Behaviors from Fiction to Power Interactive Systems , 2016, CHI.

[19]  Justin Salamon,et al.  Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification , 2016, IEEE Signal Processing Letters.

[20]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[21]  Paul J. M. Havinga,et al.  Towards detection of bad habits by fusing smartphone and smartwatch sensors , 2015, 2015 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops).

[22]  Qiang Yang,et al.  Cross-domain activity recognition via transfer learning , 2011, Pervasive Mob. Comput..

[23]  Gerhard Tröster,et al.  Towards scalable activity recognition: adapting zero-effort crowdsourced acoustic models , 2013, MUM.

[24]  Ning Liu,et al.  Bathroom Activity Monitoring Based on Sound , 2005, Pervasive.

[25]  Gierad Laput,et al.  Ubicoustics: Plug-and-Play Acoustic Activity Recognition , 2018, UIST.

[26]  Gregory D. Abowd,et al.  Inferring Meal Eating Activities in Real World Settings from Ambient Sounds: A Feasibility Study , 2015, IUI.

[27]  Gierad Laput,et al.  Synthetic Sensors: Towards General-Purpose Sensing , 2017, CHI.

[28]  Gerhard Tröster,et al.  Recognizing Daily Life Context Using Web-Collected Audio Data , 2012, 2012 16th International Symposium on Wearable Computers.

[29]  Soo-Young Lee,et al.  Environmental audio scene and activity recognition through mobile-based crowdsourcing , 2012, IEEE Transactions on Consumer Electronics.

[30]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[31]  Koji Yatani,et al.  BodyScope: a wearable acoustic sensor for activity recognition , 2012, UbiComp.

[32]  Mirco Musolesi,et al.  Sensing meets mobile social networks: the design, implementation and evaluation of the CenceMe application , 2008, SenSys '08.

[33]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[34]  Justin Salamon,et al.  A Dataset and Taxonomy for Urban Sound Research , 2014, ACM Multimedia.

[35]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Gregory D. Abowd,et al.  A practical approach for recognizing eating moments with wrist-mounted inertial sensing , 2015, UbiComp.

[37]  Jeff A. Bilmes,et al.  Conversation detection and speaker segmentation in privacy-sensitive situated speech data , 2007, INTERSPEECH.

[38]  Wei Pan,et al.  SoundSense: scalable sound sensing for people-centric applications on mobile phones , 2009, MobiSys '09.

[39]  Muhammad Usman Ilyas,et al.  Activity recognition using smartphone sensors , 2013, 2013 IEEE 10th Consumer Communications and Networking Conference (CCNC).

[40]  Bhiksha Raj,et al.  AudioPairBank: towards a large-scale tag-pair-based audio content analysis , 2016, EURASIP Journal on Audio, Speech, and Music Processing.

[41]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[42]  Yong Xu,et al.  Audio Set Classification with Attention Model: A Probabilistic Perspective , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).