Uncertainty-sensitive Activity Recognition: A Reliability Benchmark and the CARING Models

Beyond assigning the correct class, an activity recognition model should also be able to determine, how certain it is in its predictions. We present the first study of how well the confidence values of modern action recognition architectures indeed reflect the probability of the correct outcome and propose a learning-based approach for improving it. First, we extend two popular action recognition datasets with a reliability benchmark in form of the expected calibration error and reliability diagrams. Since our evaluation highlights that confidence values of standard action recognition architectures do not represent the uncertainty well, we introduce a new approach which learns to transform the model output into realistic confidence estimates through an additional calibration network. The main idea of our Calibrated Action Recognition with Input Guidance (CARING) model is to learn an optimal scaling parameter depending on the video representation. We compare our model with the native action recognition networks and the temperature scaling approach a wide spread calibration method utilized in image classification. While temperature scaling alone drastically improves the reliability of the confidence values, our CARING method consistently leads to the best uncertainty estimates in all benchmark settings.

[1]  Wolfram Burgard,et al.  The limits and potentials of deep learning for robotics , 2018, Int. J. Robotics Res..

[2]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[3]  Rainer Stiefelhagen,et al.  End-to-end Prediction of Driver Intention using 3D Convolutional Neural Networks , 2019, 2019 IEEE Intelligent Vehicles Symposium (IV).

[4]  Ali Farhadi,et al.  Towards Transparent Systems: Semantic Characterization of Failure Modes , 2014, ECCV.

[5]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Alina Roitberg,et al.  CNN-based Driver Activity Understanding: Shedding Light on Deep Spatiotemporal Representations , 2020, 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC).

[8]  Rainer Stiefelhagen,et al.  Drive&Act: A Multi-Modal Dataset for Fine-Grained Driver Behavior Recognition in Autonomous Vehicles , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[11]  Rainer Stiefelhagen,et al.  Informed Democracy: Voting-based Novelty Detection for Action Recognition , 2018, BMVC.

[12]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[13]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[15]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[16]  Lucila Ohno-Machado,et al.  A tutorial on calibration measurements and calibration models for clinical prediction models , 2020, J. Am. Medical Informatics Assoc..

[17]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[18]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[19]  Alina Roitberg,et al.  Open Set Driver Activity Recognition , 2020, 2020 IEEE Intelligent Vehicles Symposium (IV).

[20]  Giorgio Vallortigara,et al.  Probabilistic cognition in two indigenous Mayan groups , 2014, Proceedings of the National Academy of Sciences.

[21]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[22]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[24]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[25]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Andrew Zisserman,et al.  Relaxed Softmax: Efficient Confidence Auto-Calibration for Safe Pedestrian Detection , 2018 .

[27]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[28]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[29]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[30]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Alois Knoll,et al.  Multimodal Human Activity Recognition for Industrial Manufacturing Processes in Robotic Workcells , 2015, ICMI.

[32]  Jason Yosinski,et al.  Deep neural networks are easily fooled: High confidence predictions for unrecognizable images , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.