Disjoint Multi-task Learning Between Heterogeneous Human-Centric Tasks

Human behavior understanding is arguably one of the most important mid-level components in artificial intelligence. In order to efficiently make use of data, multi-task learning has been studied in diverse computer vision tasks including human behavior understanding. However, multitask learning relies on task specific datasets and constructing such datasets can be cumbersome. It requires huge amounts of data, labeling efforts, statistical consideration etc. In this paper, we leverage existing single-task datasets for human action classification and captioning data for efficient human behavior learning. Since the data in each dataset has respective heterogeneous annotations, traditional multi-task learning is not effective in this scenario. To this end, we propose a novel alternating directional optimization method to efficiently learn from the heterogeneous data. We demonstrate the effectiveness of our model and show performance improvements on both classification and sentence retrieval tasks in comparison to the models trained on each of the single-task datasets.

[1]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[2]  C. Heyes,et al.  Motor learning by observation: Evidence from a serial reaction time task , 2002, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[3]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[5]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[6]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[8]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[9]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[10]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[11]  Narendra Ahuja,et al.  Robust visual tracking via multi-task sparse learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[13]  Qiang Yang,et al.  Lifelong Machine Learning Systems: Beyond Learning Algorithms , 2013, AAAI Spring Symposium: Lifelong Machine Learning.

[14]  Joshua B. Tenenbaum,et al.  Learning with Hierarchical-Deep Models , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[17]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[18]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[21]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[22]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[24]  Jieping Ye,et al.  Safe Screening for Multi-Task Feature Learning with Multiple Data Matrices , 2015, ICML.

[25]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[26]  Licheng Yu,et al.  Visual Madlibs: Fill in the blank Image Generation and Question Answering , 2015, ArXiv.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[31]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andrea Vedaldi,et al.  Integrated perception with recurrent multi-task neural networks , 2016, NIPS.

[34]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[36]  Wanqing Li,et al.  Human detection from images and videos: A survey , 2016, Pattern Recognit..

[37]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Quoc V. Le,et al.  Multi-task Sequence to Sequence Learning , 2015, ICLR.

[41]  Mohan S. Kankanhalli,et al.  Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Andrea Vedaldi,et al.  Universal representations: The missing link between faces, text, planktons, and cat breeds , 2017, ArXiv.

[43]  Andrew Zisserman,et al.  Recurrent Human Pose Estimation , 2016, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[44]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[45]  Gaurav Sharma,et al.  AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[47]  Luc Van Gool,et al.  Deep Temporal Linear Encoding Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Tinne Tuytelaars,et al.  Expert Gate: Lifelong Learning with a Network of Experts , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Iasonas Kokkinos,et al.  UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision Using Diverse Datasets and Limited Memory , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[51]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[55]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Xi Chen,et al.  Learning From Demonstration in the Wild , 2018, 2019 International Conference on Robotics and Automation (ICRA).