Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.

[1]  Yong Zhang,et al.  T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations , 2023, ArXiv.

[2]  Yang Yang,et al.  Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition , 2022, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Sungjoon Choi,et al.  Learning Joint Representation of Human Motion and Language , 2022, ArXiv.

[4]  Dahua Lin,et al.  DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition , 2022, ArXiv.

[5]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[6]  Zhongang Cai,et al.  MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  M. Dixit,et al.  A comprehensive survey on human pose estimation approaches , 2022, Multimedia Systems.

[8]  Marcella Cornia,et al.  ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval , 2022, CBMI.

[9]  Lisheng Wang,et al.  Animating Images to Transfer CLIP for Video-Text Retrieval , 2022, SIGIR.

[10]  Sen Wang,et al.  TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts , 2022, ECCV.

[11]  Andrea Esuli,et al.  Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching , 2022, ArXiv.

[12]  Sen Wang,et al.  Generating Diverse and Natural 3D Human Motions from Text , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yi Yang,et al.  CenterCLIP: Token Clustering for Efficient Text-Video Retrieval , 2022, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[14]  James R. Glass,et al.  Everything at Once – Multi-modal Fusion Transformer for Video Retrieval , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Pavel Zezula,et al.  Efficient Indexing of 3D Human Motions , 2021, ICMR.

[16]  Liang Lin,et al.  Hierarchical Transformer: Unsupervised Representation Learning for Skeleton-Based Human Action Recognition , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[17]  Guoying Zhao,et al.  Tripool: Graph triplet pooling for 3D skeleton-based action recognition , 2021, Pattern Recognit..

[18]  Pengfei Xiong,et al.  CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[19]  Ying Wang,et al.  Deep Hashing for Motion Capture Data Retrieval , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Claudio Gennaro,et al.  Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features , 2021, 2021 International Conference on Content-Based Multimedia Indexing (CBMI).

[21]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[22]  Michael J. Black,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  C. Theobalt,et al.  Synthesis of Compositional Animations from Textual Descriptions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Liang Lin,et al.  Motion-transformer: self-supervised pre-training for skeleton-based action recognition , 2021, MMAsia.

[26]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27]  Wenhan Yang,et al.  MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition , 2020, ACM Multimedia.

[28]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[29]  Andrea Esuli,et al.  Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[30]  Shihao Zou,et al.  Action2Motion: Conditioned Generation of 3D Human Motions , 2020, ACM Multimedia.

[31]  Andrea Esuli,et al.  Transformer Reasoning Network for Image- Text Matching and Retrieval , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[32]  Pavel Zezula,et al.  Motion Words: A Text-Like Representation of 3D Skeleton Sequences , 2020, ECIR.

[33]  Pavel Zezula,et al.  LSTM-based real-time action detection and prediction in human motion streams , 2019, Multimedia Tools and Applications.

[34]  Bjorn Ottersten,et al.  Two-Stage RGB-Based Action Detection Using Augmented 3D Poses , 2019, CAIP.

[35]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[39]  Tamim Asfour,et al.  The KIT Motion-Language Dataset , 2016, Big Data.

[40]  Andrea Esuli,et al.  Picture it in your mind: generating high level visual representations from textual descriptions , 2016, Information Retrieval Journal.

[41]  Stefan Ulbrich,et al.  Master Motor Map (MMM) — Framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[42]  Tobias Schreck,et al.  MotionExplorer: Exploratory Search in Human Motion Capture Data Based on Hierarchical Aggregation , 2013, IEEE Transactions on Visualization and Computer Graphics.

[43]  Norman I. Badler,et al.  Efficient motion retrieval in large motion databases , 2013, I3D '13.

[44]  Atsushi Nakazawa,et al.  A puppet interface for retrieval of motion capture data , 2011, SCA '11.

[45]  Zhigang Deng,et al.  Perceptually consistent example-based human motion retrieval , 2009, I3D '09.

[46]  G. Amato,et al.  SegmentCodeList: Unsupervised Representation Learning for Human Skeleton Data Retrieval , 2023, European Conference on Information Retrieval.

[47]  Pavel Zezula,et al.  Content-Based Management of Human Motion Data: Survey and Challenges , 2021, IEEE Access.

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .