Enabling Edge Devices that Learn from Each Other: Cross Modal Training for Activity Recognition

Edge devices rely extensively on machine learning for intelligent inferences and pattern matching. However, edge devices use a multitude of sensing modalities and are exposed to wide ranging contexts. It is difficult to develop separate machine learning models for each scenario as manual labeling is not scalable. To reduce the amount of labeled data and to speed up the training process, we propose to transfer knowledge between edge devices by using unlabeled data. Our approach, called RecycleML, uses cross modal transfer to accelerate the learning of edge devices across different sensing modalities. Using human activity recognition as a case study, over our collected CMActivity dataset, we observe that RecycleML reduces the amount of required labeled data by at least 90% and speeds up the training process by up to 50 times in comparison to training the edge device from scratch.

[1]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[2]  Rainer Stiefelhagen,et al.  CNN-based sensor fusion techniques for multimodal human activity recognition , 2017, SEMWEB.

[3]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[4]  Xiaoli Li,et al.  Deep Convolutional Neural Networks on Multichannel Time Series for Human Activity Recognition , 2015, IJCAI.

[5]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[6]  C. Harte,et al.  Detecting harmonic change in musical audio , 2006, AMCMM '06.

[7]  Beth Logan,et al.  Mel Frequency Cepstral Coefficients for Music Modeling , 2000, ISMIR.

[8]  VALENTIN RADU,et al.  Multimodal Deep Learning for Activity and Context Recognition , 2018, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[9]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[10]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  David L. Mills,et al.  Internet time synchronization: the network time protocol , 1991, IEEE Trans. Commun..

[14]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[15]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[16]  Mahesh K. Marina,et al.  Towards multimodal deep learning for activity recognition on mobile devices , 2016, UbiComp Adjunct.

[17]  Antonio Torralba,et al.  Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[19]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[20]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[21]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Mani Srivastava,et al.  MiLift: Efficient Smartwatch-Based Workout Tracking Using Automatic Segmentation , 2018, IEEE Transactions on Mobile Computing.

[23]  Lie Lu,et al.  Music type classification by spectral contrast feature , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[24]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).