Response Time Analysis for Explainability of Visual Processing in CNNs

Explainable artificial intelligence (XAI) methods rely on access to model architecture and parameters that is not always feasible for most users, practitioners, and regulators. Inspired by cognitive psychology, we present a case for response times (RTs) as a technique for XAI. RTs are observable without access to the model. Moreover, dynamic inference models performing conditional computation generate variable RTs for visual learning tasks depending on hierarchical representations. We show that MSDNet, a conditional computation model with early-exit architecture, exhibits slower RT for images with more complex features in the ObjectNet test set, as well as the human phenomenon of scene grammar, where object recognition depends on intrascene object-object relationships. These results cast light on MSDNet’s feature space without opening the black box and illustrate the promise of RT methods for XAI.

[1]  F. Donders On the speed of mental processes. , 1969, Acta psychologica.

[2]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[3]  M. Võ,et al.  Reading scenes: how scene grammar guides attention and aids perception in real-world environments. , 2019, Current opinion in psychology.

[4]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[5]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  J. Wolfe,et al.  Differential Electrophysiological Signatures of Semantic and Syntactic Scene Processing , 2013, Psychological science.

[7]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[8]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[9]  Marta Pereira,et al.  Scene Grammar in Human and Machine Recognition of Objects and Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Sabine Öhlschläger,et al.  SCEGRAM: An image database for semantic and syntactic inconsistencies in scenes , 2017, Behavior research methods.

[11]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[12]  Cynthia Breazeal,et al.  Machine behaviour , 2019, Nature.

[13]  Christoph H. Lampert,et al.  Distillation-Based Training for Multi-Exit Architectures , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Samuel Ritter,et al.  Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study , 2017, ICML.