BigSmall: Efficient Multi-Task Learning for Disparate Spatial and Temporal Physiological Measurements

Understanding of human visual perception has historically inspired the design of computer vision architectures. As an example, perception occurs at different scales both spatially and temporally, suggesting that the extraction of salient visual information may be made more effective by paying attention to specific features at varying scales. Visual changes in the body due to physiological processes also occur at different scales and with modality-specific characteristic properties. Inspired by this, we present BigSmall, an efficient architecture for physiological and behavioral measurement. We present the first joint camera-based facial action, cardiac, and pulmonary measurement model. We propose a multi-branch network with wrapping temporal shift modules that yields both accuracy and efficiency gains. We observe that fusing low-level features leads to suboptimal performance, but that fusing high level features enables efficiency gains with negligible loss in accuracy. Experimental results demonstrate that BigSmall significantly reduces the computational costs. Furthermore, compared to existing task-specific models, BigSmall achieves comparable or better results on multiple physiological measurement tasks simultaneously with a unified model.

[1]  Changchen Zhao,et al.  Learning Spatio-Temporal Pulse Representation With Global-Local Interaction and Supervision for Remote Prediction of Heart Rate. , 2023, IEEE journal of biomedical and health informatics.

[2]  Philip H. S. Torr,et al.  PhysFormer++: Facial Video-Based Physiological Measurement with SlowFast Temporal Difference Transformer , 2023, International Journal of Computer Vision.

[3]  Daniel J. McDuff,et al.  EfficientPhys: Enabling Simple, Fast and Accurate Camera-Based Cardiac Measurement , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[4]  Daniel McDuff,et al.  Camera Measurement of Physiological Vital Signs , 2021, ACM Comput. Surv..

[5]  Daniel J. McDuff,et al.  SimPer: Simple Self-Supervised Learning of Periodic Targets , 2022, ICLR.

[6]  Daniel J. McDuff,et al.  Deep Physiological Sensing Toolbox , 2022, ArXiv.

[7]  D. Katabi,et al.  On Multi-Domain Long-Tailed Recognition, Imbalanced Domain Generalization and Beyond , 2022, ECCV.

[8]  E. Muller,et al.  Informing deep neural networks by multiscale principles of neuromodulatory systems , 2022, Trends in Neurosciences.

[9]  Philip H. S. Torr,et al.  PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Brent J. Hecht,et al.  Behavioral Use Licensing for Responsible AI , 2020, FAccT.

[11]  Simon Stent,et al.  The Way to my Heart is through Contrastive Learning: Remote Photoplethysmography from Unlabelled Video , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Hu Han,et al.  Dual-GAN: Joint BVP and Noise Modeling for Remote Physiological Measurement , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Guoying Zhao,et al.  TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection , 2021, IEEE Signal Processing Letters.

[14]  Zhenguo Li,et al.  DetCo: Unsupervised Contrastive Learning for Object Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Daniel J. McDuff,et al.  Contrastive Learning of Global and Local Video Representations , 2021, NeurIPS.

[16]  Daniel McDuff,et al.  The Benefit of Distraction: Denoising Remote Vitals Measurements using Inverse Attention , 2020, ArXiv.

[17]  R. Devon Hjelm,et al.  Representation Learning with Video Deep InfoMax , 2020, ArXiv.

[18]  Xin Liu,et al.  Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement , 2020, NeurIPS.

[19]  Xuan Song,et al.  The role of telemedicine during the COVID-19 epidemic in China—experience from Shandong province , 2020, Critical Care.

[20]  Centaine L Snoswell,et al.  Telehealth for global emergencies: Implications for coronavirus disease 2019 (COVID-19) , 2020, Journal of telemedicine and telecare.

[21]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[22]  Maja Pantic,et al.  Automatic Analysis of Facial Actions: A Survey , 2019, IEEE Transactions on Affective Computing.

[23]  Shiguang Shan,et al.  Self-Supervised Representation Learning From Videos for Facial Action Unit Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Guoying Zhao,et al.  Remote Photoplethysmograph Signal Measurement from Facial Videos Using Spatio-Temporal Networks , 2019, BMVC.

[25]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[28]  Yannick Benezeth,et al.  Unsupervised skin tissue segmentation for remote photoplethysmography , 2017, Pattern Recognit. Lett..

[29]  Huang Yan,et al.  Local Relationship Learning With Person-Specific Shape Regularization for Facial Action Unit Detection , 2019 .

[30]  Daniel McDuff,et al.  DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks , 2018, ECCV.

[31]  Sergio Escalera,et al.  Deep Structure Inference Network for Facial Action Unit Recognition , 2018, ECCV.

[32]  Yan Wang,et al.  Recognition of Action Units in the Wild with Deep Nets and a New Global-Local Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Sander Stuijk,et al.  Algorithmic Principles of Remote PPG , 2017, IEEE Transactions on Biomedical Engineering.

[34]  Honggang Zhang,et al.  Deep Region and Multi-label Learning for Facial Action Unit Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Shaun J. Canavan,et al.  Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Michel F. Valstar,et al.  Deep learning the dynamic appearance and shape of facial action units , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  H. Emrah Tasli,et al.  Deep learning based FACS Action Unit occurrence and intensity estimation , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[38]  Sidney K. D'Mello,et al.  A Review and Meta-Analysis of Multimodal Affect Detection Systems , 2015, ACM Comput. Surv..

[39]  J. Cohn,et al.  Automated Face Analysis for Affective Computing , 2015 .

[40]  Horst-Michael Groß,et al.  Non-contact video-based pulse rate measurement on a mobile service robot , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.

[41]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Gerard de Haan,et al.  Robust Pulse Rate From Chrominance-Based rPPG , 2013, IEEE Transactions on Biomedical Engineering.

[43]  Shaun J. Canavan,et al.  BP4D-Spontaneous: a high-resolution spontaneous 3D dynamic facial expression database , 2014, Image Vis. Comput..

[44]  Mohammad H. Mahoor,et al.  DISFA: A Spontaneous Facial Action Intensity Database , 2013, IEEE Transactions on Affective Computing.

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Verónica Pérez-Rosas,et al.  Towards sensing the influence of visual narratives on human affect , 2012, ICMI '12.

[47]  Daniel McDuff,et al.  Advancements in Noncontact, Multiparameter Physiological Measurements Using a Webcam , 2011, IEEE Transactions on Biomedical Engineering.

[48]  Rosalind W. Picard,et al.  Non-contact, automated cardiac pulse measurements using video imaging and blind source separation , 2022 .

[49]  L. O. Svaasand,et al.  Remote plethysmographic imaging using ambient light. , 2008, Optics express.

[50]  Simon Lucey,et al.  Investigating Spontaneous Facial Action Recognition through AAM Representations of the Face , 2007 .

[51]  Jeffrey F. Cohn,et al.  Observer-based measurement of facial expression with the Facial Action Coding System. , 2007 .

[52]  P. Ekman,et al.  What the face reveals : basic and applied studies of spontaneous expression using the facial action coding system (FACS) , 2005 .

[53]  Edward H. Adelson,et al.  Motion illusions as optimal percepts , 2002, Nature Neuroscience.

[54]  A. Oliva,et al.  From Blobs to Boundary Edges: Evidence for Time- and Spatial-Scale-Dependent Scene Recognition , 1994 .

[55]  P. Ekman,et al.  Autonomic nervous system activity distinguishes among emotions. , 1983, Science.