JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection

The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognize human activities in an unconstrained real-world environment, with potentially highly unbalanced and longtailed-distributed data remains a significant challenge, not least owing to the lack of a reflective large-scale dataset. Most existing large-scale datasets are either collected from a specific or constrained environment, e.g. kitchens or rooms, or video sharing platforms such as YouTube. In this paper, we introduce JRDB-Act, a multi-modal dataset, as an extension of the existing JRDB, which is captured by a social mobile manipulator and reflects a real distribution of human daily life actions in a university campus environment. JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels, constituting a large-scale spatio-temporal action detection dataset. Each human bounding box is labeled with one pose-based action label and multiple (optional) interaction-based action labels. Moreover JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene to infer their social activities (common activities in each social group). 1. The JRDB-Act Dataset The multi-modal JRDB dataset [4, 5] is composed of 64 minutes of sensory data obtained from the mobile JackRabbot robot, containing 54 sequences of indoor and outdoor scenes in a university campus environment, covering different human poses, behaviors, and social interactions. JRDB provides the following ground truth labels: 1) over 2.4 million 2D bounding boxes for all the persons visible in the five stereo RGB cameras, capturing a panoramic cylindrical 360◦image view, 2) over 1.8 million 3D oriented bounding boxes in the point-clouds captured from the two 16-array LiDAR sensors, 3) association of all the 3D bounding boxes with the corresponding 2D boxes, and 4) track ID of all the 2D and 3D boxes over time. We propose JRDB-Act by providing additional individuals’ human action labels and social grouping annotation on top of the existing JRDB. All these annotations make JRDB-Act the only available multimodal dataset for learning multiple tasks such as human detection, tracking, social group formation, individual action detection, and social activity recognition. In this section, we elaborate on different aspects of this dataset. A. Action Vocabulary: Since JRDB is collected in a campus environment, our action vocabulary consists of common daily human actions. By accurately observing the dataset, we ended up with 11 pose-based action classes: Walking, Standing, Sitting, Cycling, Going upstairs, Bending, Going downstairs, Skating, Scootering, Running, Lying, 3 humanhuman interaction classes: Talking to someone, Listening to someone, Greeting gestures and 12 human-object interaction classes: Holding something, Looking into something (e.g. monitor, TV, tablet, etc.), Looking at robot, Looking at something (e.g. poster), Typing, Interaction with door, Talking on the phone, Eating something, Reading, Writing, Pointing at something, Pushing. B. Action Annotation: Action annotation is densely provided per-frame and per-box for both the LiDAR and video sequences. However, we only used the panoramic videos to annotate the action labels. During the annotation process, we utilized JRDB annotated 2D-bounding boxes and track-IDs; for each bounding box, one (mandatory) posebased action label and any number of (optional) interactionbased action labels (human-human, human-object) were picked from the comprehensive list of action vocabularies. If none of the action labels in the list were descriptive for a bounding box, annotators were able to tag the box as ar X iv :2 10 6. 08 82 7v 1 [ cs .C V ] 1 6 Ju n 20 21

[1]  Silvio Savarese,et al.  JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Silvio Savarese,et al.  JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Ian Reid,et al.  Joint Learning of Social Groups, Individuals Action and Sub-group Activities in Videos , 2020, ECCV.

[4]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[6]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.