Gesture and Action Discovery for Evaluating Virtual Environments with Semi-Supervised Segmentation of Telemetry Records

In this paper, we propose a novel pipeline for semi-supervised behavioral coding of videos of users testing a device or interface, with an eye toward human-computer interaction evaluation for virtual reality. Our system applies existing statistical techniques for time-series classification, including e-divisive change point detection and "Symbolic Aggregate approXimation" (SAX) with agglomerative hierarchical clustering, to 3D pose telemetry data. These techniques create classes of short segments of single-person video data–short actions of potential interest called "micro-gestures." A long short-term memory (LSTM) layer then learns these micro-gestures from pose features generated purely from video via a pre-trained OpenPose convolutional neural network (CNN) to predict their occurrence in unlabeled test videos. We present and discuss the results from testing our system on the single user pose videos of the CMU Panoptic Dataset.

[1]  H. Reis,et al.  Handbook of Research Methods in Social and Personality Psychology: Author Index , 2013 .

[2]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[3]  David S. Matteson,et al.  ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data , 2013, 1309.3295.

[4]  David S. Matteson,et al.  A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data , 2013, 1306.4933.

[5]  Daniel Vogel,et al.  Applying the Cumulative Fatigue Model to Interaction on Large, Multi-Touch Displays , 2018, PerDis.

[6]  Lorenzo Torresani,et al.  Detect-and-Track: Efficient Pose Estimation in Videos , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Luc Van Gool,et al.  Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Lei Liu,et al.  Modeling Object Pursuit for Desktop Virtual Reality , 2012, IEEE Transactions on Visualization and Computer Graphics.

[9]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[10]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Radu-Daniel Vatavu,et al.  KeyTime: Super-Accurate Prediction of Stroke Gesture Production Times , 2018, CHI.

[12]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Danai Koutra,et al.  Glance: rapidly coding behavioral video with the crowd , 2014, UIST.

[14]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[15]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Daniel Vogel,et al.  Estimating the Perceived Difficulty of Pen Gestures , 2011, INTERACT.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Abigail Sellen,et al.  A comparison of input devices in element pointing and dragging tasks , 1991, CHI.

[19]  Karrie Karahalios,et al.  VCode and VData: illustrating a new framework for supporting the video annotation workflow , 2008, AVI '08.

[20]  Alvin Jude,et al.  Models for Rested Touchless Gestural Interaction , 2015, SUI.

[21]  Philip L. Smith,et al.  Quantitative Coding of Negotiation Behavior , 2004 .

[22]  Allen Newell,et al.  The psychology of human-computer interaction , 1983 .

[23]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Karthik Ramani,et al.  Modeling Cumulative Arm Fatigue in Mid-Air Interaction based on Perceived Exertion and Kinetics of Arm Motion , 2017, CHI.

[25]  Brandon Burr,et al.  VACA: a tool for qualitative video analysis , 2006, CHI Extended Abstracts.

[26]  Shumin Zhai,et al.  Human Action Laws in Electronic Virtual Worlds: An Empirical Study of Path Steering Performance in VR , 2004, Presence: Teleoperators & Virtual Environments.

[27]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Interaction Capture , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.