Is there progress in activity progress prediction?

Activity progress prediction aims to estimate what percentage of an activity has been completed. Currently this is done with machine learning approaches, trained and evaluated on complicated and realistic video datasets. The videos in these datasets vary drastically in length and appearance. And some of the activities have unanticipated developments, making activity progression difficult to estimate. In this work, we examine the results obtained by existing progress prediction methods on these datasets. We find that current progress prediction methods seem not to extract useful visual information for the progress prediction task. Therefore, these methods fail to exceed simple frame-counting baselines. We design a precisely controlled dataset for activity progress prediction and on this synthetic dataset we show that the considered methods can make use of the visual information, when this directly relates to the progress prediction. We conclude that the progress prediction task is ill-posed on the currently used real-world datasets. Moreover, to fairly measure activity progression we advise to consider a, simple but effective, frame-counting baseline.

[1]  Muhammad Abdullah Jamal,et al.  SurgMAE: Masked Autoencoders for Long Surgical Video Analysis , 2023, ArXiv.

[2]  Luis C. García-Peraza-Herrera,et al.  LoViT: Long Video Transformer for Surgical Phase Recognition , 2023, ArXiv.

[3]  Yuta Nakashima,et al.  Real-time estimation of the remaining surgery duration for cataract surgery using deep convolutional neural networks and long short-term memory , 2023, BMC Medical Informatics and Decision Making.

[4]  A. del Bimbo,et al.  Joint-Based Action Progress Prediction , 2023, Sensors.

[5]  Raphael Sznitman,et al.  CataNet: Predicting remaining cataract surgery duration , 2021, MICCAI.

[6]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Danail Stoyanov,et al.  Multi-Task Recurrent Neural Network for Surgical Gesture Recognition and Progress Prediction , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Hilde Kuehne,et al.  Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Majid Mirmehdi,et al.  Weakly-Supervised Completion Moment Detection using Temporal Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[10]  Fadime Sener,et al.  Unsupervised Learning of Action Classes With Continuous Temporal Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Bo Hu,et al.  Progress Regression RNN for Online Spatial-Temporal Action Localization in Unconstrained Videos , 2019, ArXiv.

[12]  Gaurav Yengera,et al.  Less is More: Surgical Phase Recognition with Less Annotations through Self-Supervised Pre-training of CNN-LSTM Networks , 2018, ArXiv.

[13]  Majid Mirmehdi,et al.  Action Completion: A Temporal Model for Moment Detection , 2018, BMVC.

[14]  Andru Putra Twinanda,et al.  RSDNet: Learning to Predict Remaining Surgery Duration from Laparoscopic Videos Without Manual Annotations , 2018, IEEE Transactions on Medical Imaging.

[15]  Anoop Cherian,et al.  Human Action Forecasting by Learning Task Grammars , 2017, ArXiv.

[16]  Andru Putra Twinanda,et al.  Deep Neural Networks Predict Remaining Surgery Duration from Cholecystectomy Videos , 2017, MICCAI.

[17]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alberto Del Bimbo,et al.  Am I Done? Predicting Action Progress in Videos , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[19]  Ivan Marsic,et al.  Progress Estimation and Phase Detection for Sequential Processes , 2017, Proc. ACM Interact. Mob. Wearable Ubiquitous Technol..

[20]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Majid Mirmehdi,et al.  Beyond Action Recognition: Action Completion in RGB-D Data , 2016, BMVC.

[22]  Andru Putra Twinanda,et al.  EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos , 2016, IEEE Transactions on Medical Imaging.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Thomas Serre,et al.  An end-to-end generative framework for video segmentation and recognition , 2015, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Thomas Serre,et al.  The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[31]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.