TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

Deep Learning (DL) models have achieved unprecedented success in various tasks and are pervasively deployed in real-world applications. It is critical to guarantee their correctness by testing their behaviors. However, DL systems are notoriously difficult to test and debug due to the lack of explainability and the huge test input space to cover. Generally speaking, it is relatively easy to collect a massive amount of test data, but the labeling cost can be quite high. Consequently, it is essential to conduct test selection and label only those selected ‘high quality’ bug-revealing test inputs for test cost reduction. In this paper, we propose a novel test prioritization technique that brings order into the unlabeled test instances according to their bug-revealing capabilities, namely TestRank. Different from existing solutions, TestRank leverages both intrinsic attributes and contextual attributes of test instances when prioritizing them. To be specific, we first build a similarity graph on test instances and training samples, and we conduct graph-based semi-supervised learning to extract contextual features. Then, for a particular test instance, the contextual features extracted from the graph neural network (GNN) and the intrinsic features obtained with the DL model itself are combined to predict its bug-revealing probability. Finally, TestRank prioritizes unlabeled test instances in descending order of the above probability value. We evaluate the performance of TestRank on a variety of image classification datasets. Experimental results show that the debugging efficiency of our method significantly outperforms existing test prioritization techniques.

[1]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[2]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[4]  Sanjai Rayadurgam,et al.  Input Prioritization for Testing Neural Networks , 2019, 2019 IEEE International Conference On Artificial Intelligence Testing (AITest).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiangyu Zhang,et al.  Correlations between deep neural network model coverage criteria and model quality , 2020, ESEC/SIGSOFT FSE.

[7]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[8]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[9]  Lin Chen,et al.  Multiple-Boundary Clustering and Prioritization to Promote Neural Network Retraining , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[10]  Sheena Singh,et al.  Techniques of Test Case Prioritization , 2016 .

[11]  Gregg Rothermel,et al.  Incorporating varying test costs and fault severities into test case prioritization , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[12]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[13]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[14]  Tingyang Xu,et al.  DropEdge: Towards Deep Graph Convolutional Networks on Node Classification , 2020, ICLR.

[15]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[16]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[17]  Junfeng Yang,et al.  DeepXplore: Automated Whitebox Testing of Deep Learning Systems , 2017, SOSP.

[18]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[19]  Gregg Rothermel,et al.  On the Use of Mutation Faults in Empirical Assessments of Test Case Prioritization Techniques , 2006, IEEE Transactions on Software Engineering.

[20]  Shin Yoo,et al.  Guiding Deep Learning System Testing Using Surprise Adequacy , 2018, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[21]  Yang Feng,et al.  DeepGini: prioritizing massive tests to enhance the robustness of deep neural networks , 2020, ISSTA.

[22]  Bogdan Korel,et al.  Model-based test prioritization heuristic methods and their evaluation , 2007, A-MOST '07.

[23]  Lei Ma,et al.  DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[24]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[25]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[26]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[27]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[30]  Miryung Kim,et al.  Is neuron coverage a meaningful measure for testing deep neural networks? , 2020, ESEC/SIGSOFT FSE.

[31]  Lu Zhang,et al.  Predictive Mutation Testing , 2016, IEEE Transactions on Software Engineering.

[32]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[33]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[34]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .