论文信息 - Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.

F. Falchi | Nicola Messina | J. Sedmidubský | Tom'avs Rebok

[1] Yong Zhang,et al. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations , 2023, ArXiv.

[2] Yang Yang,et al. Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[3] Sungjoon Choi,et al. Learning Joint Representation of Human Motion and Language , 2022, ArXiv.

[4] Dahua Lin,et al. DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition , 2022, ArXiv.

[5] Amit H. Bermano,et al. Human Motion Diffusion Model , 2022, ICLR.

[6] Zhongang Cai,et al. MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] M. Dixit,et al. A comprehensive survey on human pose estimation approaches , 2022, Multimedia Systems.

[8] Marcella Cornia,et al. ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval , 2022, CBMI.

[9] Lisheng Wang,et al. Animating Images to Transfer CLIP for Video-Text Retrieval , 2022, SIGIR.

[10] Sen Wang,et al. TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts , 2022, ECCV.

[11] Andrea Esuli,et al. Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching , 2022, ArXiv.

[12] Sen Wang,et al. Generating Diverse and Natural 3D Human Motions from Text , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Yi Yang,et al. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval , 2022, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[14] James R. Glass,et al. Everything at Once – Multi-modal Fusion Transformer for Video Retrieval , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Pavel Zezula,et al. Efficient Indexing of 3D Human Motions , 2021, ICMR.

[16] Liang Lin,et al. Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[17] Guoying Zhao,et al. Tripool: Graph triplet pooling for 3D skeleton-based action recognition , 2021, Pattern Recognit..

[18] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[19] Ying Wang,et al. Deep Hashing for Motion Capture Data Retrieval , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Claudio Gennaro,et al. Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features , 2021, 2021 International Conference on Content-Based Multimedia Indexing (CBMI).

[21] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[22] Michael J. Black,et al. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] C. Theobalt,et al. Synthesis of Compositional Animations from Textual Descriptions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Liang Lin,et al. Motion-transformer: self-supervised pre-training for skeleton-based action recognition , 2021, MMAsia.

[26] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27] Wenhan Yang,et al. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition , 2020, ACM Multimedia.

[28] Christopher D. Manning,et al. Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[29] Andrea Esuli,et al. Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[30] Shihao Zou,et al. Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[31] Andrea Esuli,et al. Transformer Reasoning Network for Image- Text Matching and Retrieval , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[32] Pavel Zezula,et al. Motion Words: A Text-Like Representation of 3D Skeleton Sequences , 2020, ECIR.

[33] Pavel Zezula,et al. LSTM-based real-time action detection and prediction in human motion streams , 2019, Multimedia Tools and Applications.

[34] Bjorn Ottersten,et al. Two-Stage RGB-Based Action Detection Using Augmented 3D Poses , 2019, CAIP.

[35] Nikolaus F. Troje,et al. AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Wenjun Zeng,et al. Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[39] Tamim Asfour,et al. The KIT Motion-Language Dataset , 2016, Big Data.

[40] Andrea Esuli,et al. Picture it in your mind: generating high level visual representations from textual descriptions , 2016, Information Retrieval Journal.

[41] Stefan Ulbrich,et al. Master Motor Map (MMM) — Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[42] Tobias Schreck,et al. MotionExplorer: Exploratory Search in Human Motion Capture Data Based on Hierarchical Aggregation , 2013, IEEE Transactions on Visualization and Computer Graphics.

[43] Norman I. Badler,et al. Efficient motion retrieval in large motion databases , 2013, I3D '13.

[44] Atsushi Nakazawa,et al. A puppet interface for retrieval of motion capture data , 2011, SCA '11.

[45] Zhigang Deng,et al. Perceptually consistent example-based human motion retrieval , 2009, I3D '09.

[46] G. Amato,et al. SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval , 2023, European Conference on Information Retrieval.

[47] Pavel Zezula,et al. Content-Based Management of Human Motion Data: Survey and Challenges , 2021, IEEE Access.

[48] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .