Test-Time Training on Video Streams

Prior work has established test-time training (TTT) as a general framework to further improve a trained model at test time. Before making a prediction on each test instance, the model is trained on the same instance using a self-supervised task, such as image reconstruction with masked autoencoders. We extend TTT to the streaming setting, where multiple test instances - video frames in our case - arrive in temporal order. Our extension is online TTT: The current model is initialized from the previous model, then trained on the current frame and a small window of frames immediately before. Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The relative improvement is 45% and 66% for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. This differs from previous findings using synthetic videos. We conceptualize locality as the advantage of online over offline TTT. We analyze the role of locality with ablations and a theory based on bias-variance trade-off.

[1]  Moritz Hardt,et al.  Test-Time Training on Nearest Neighbors for Large Language Models , 2023, ICLR.

[2]  Binhui Xie,et al.  Robust Test-Time Adaptation in Dynamic Scenarios , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Rima Basu The Importance of Forgetting , 2022, Episteme.

[4]  Alexei A. Efros,et al.  Test-Time Training with Masked Autoencoders , 2022, NeurIPS.

[5]  Pier Luigi Dovesi,et al.  Online Domain Adaptation for Semantic Segmentation in Ever-Changing Conditions , 2022, ECCV.

[6]  Chi Harold Liu,et al.  SePiCo: Semantic-Guided Pixel Contrast for Domain Adaptive Semantic Segmentation , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Pau de Jorge,et al.  On the Road to Online Adaptation for Semantic Image Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Abhinav Gupta,et al.  The Challenges of Continuous Self-Supervised Learning , 2022, ECCV.

[9]  Federico Raue,et al.  Self-supervised Test-time Adaptation on Video Data , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[10]  Alahari Karteek,et al.  Self-Supervised Models are Continual Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  A. Schwing,et al.  Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Han Hu,et al.  SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Koushil Sreenath,et al.  Online Learning of Unknown Dynamics for Model-Based Controllers in Legged Locomotion , 2021, IEEE Robotics and Automation Letters.

[15]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[16]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[17]  Daniel Cremers,et al.  STEP: Segmenting and Tracking Every Pixel , 2021, NeurIPS Datasets and Benchmarks.

[18]  Xiaofeng Liu,et al.  Energy-constrained Self-training for Unsupervised Domain Adaptation , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[19]  Andrei A. Rusu,et al.  Embracing Change: Continual Learning in Deep Neural Networks , 2020, Trends in Cognitive Sciences.

[20]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[21]  Shanghang Zhang,et al.  Instance Adaptive Self-Training for Unsupervised Domain Adaptation , 2020, ECCV.

[22]  Alexei A. Efros,et al.  Self-Supervised Policy Adaptation during Deployment , 2020, ICLR.

[23]  Matthias Bethge,et al.  Improving robustness against common corruptions by covariate shift adaptation , 2020, NeurIPS.

[24]  Andrea Vedaldi,et al.  Labelling unlabelled videos from scratch with multi-modal self-supervision , 2020, NeurIPS.

[25]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[26]  Mert Pilanci,et al.  The Hidden Convex Optimization Landscape of Regularized Two-Layer ReLU Networks: an Exact Characterization of Optimal Solutions , 2020, ICLR.

[27]  In So Kweon,et al.  Video Panoptic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Nicu Sebe,et al.  Online Depth Learning Against Forgetting in Monocular Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Han Zhang,et al.  A Simple Semi-Supervised Learning Framework for Object Detection , 2020, ArXiv.

[30]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[31]  Pietro Zanuttigh,et al.  Unsupervised Domain Adaptation with Multiple Domain Discriminators and Adaptive Self-Training , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[32]  Timothy M. Hospedales,et al.  Online Meta-Learning for Multi-Source and Semi-Supervised Domain Adaptation , 2020, ECCV.

[33]  Tengyu Ma,et al.  Understanding Self-Training for Gradual Domain Adaptation , 2020, ICML.

[34]  Yuki M. Asano,et al.  Self-labelling via simultaneous clustering and representation learning , 2019, ICLR.

[35]  Alexei A. Efros,et al.  Test-Time Training with Self-Supervision for Generalization under Distribution Shifts , 2019, ICML.

[36]  Tinne Tuytelaars,et al.  A Continual Learning Survey: Defying Forgetting in Classification Tasks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  B. Recht,et al.  Do Image Classifiers Generalize Across Time? , 2019, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Amine Bermak,et al.  Deep Exemplar-Based Video Colorization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Qifeng Chen,et al.  Fully Automatic Video Colorization With Self-Regularization and Diversity , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andreas S. Tolias,et al.  Three scenarios for continual learning , 2019, ArXiv.

[41]  Luigi di Stefano,et al.  Learning to Adapt for Stereo , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2019, ICLR.

[43]  Tinne Tuytelaars,et al.  Task-Free Continual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ravi Teja Mullapudi,et al.  Online Model Distillation for Efficient Video Inference , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  David Filliat,et al.  Don't forget, there is more than forgetting: new metrics for Continual Learning , 2018, ArXiv.

[46]  Luigi di Stefano,et al.  Real-Time Self-Adaptive Deep Stereo , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  B. V. Vijaya Kumar,et al.  Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-training , 2018, ECCV.

[48]  Ning Xu,et al.  YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[49]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[50]  Hongdong Li,et al.  Open-World Stereo Video Matching with Deep RNN , 2018, ECCV.

[51]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Arnold W. M. Smeulders,et al.  Long-term Tracking in the Wild: A Benchmark , 2018, ECCV.

[53]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[54]  M. Irani,et al.  Zero-Shot Super-Resolution Using Deep Internal Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Kaiming He,et al.  Data Distillation: Towards Omni-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  D. Hassabis,et al.  Neuroscience-Inspired Artificial Intelligence , 2017, Neuron.

[57]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[58]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[59]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[60]  Alexei A. Efros,et al.  Real-time user-guided image colorization with learned deep priors , 2017, ACM Trans. Graph..

[61]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[62]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[63]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[64]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[66]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[69]  Trevor Darrell,et al.  Continuous Manifold Based Adaptation for Evolving Visual Domains , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Sébastien Bubeck Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[71]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[72]  Erik G. Learned-Miller,et al.  Online domain adaptation of a pre-trained cascade of classifiers , 2011, CVPR 2011.

[73]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[74]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[75]  Jason Weston,et al.  Large Scale Transductive SVMs , 2006, J. Mach. Learn. Res..

[76]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[77]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Empirical Inference Science (Information Science and Statistics) , 2006 .

[78]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[79]  Alexander Gammerman,et al.  Learning by Transduction , 1998, UAI.

[80]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[81]  Trevor Darrell,et al.  Tent: Fully Test-Time Adaptation by Entropy Minimization , 2021, ICLR.

[82]  Alexandre Alahi,et al.  TTT++: When Does Self-Supervised Test-Time Training Fail or Thrive? , 2021, NeurIPS.

[83]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[84]  Roberto Basili Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[85]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.