Thinking Like a Developer? Comparing the Attention of Humans with Neural Models of Code

Neural models of code are successfully tackling various prediction tasks, complementing and sometimes even outperforming traditional program analyses. While most work focuses on end-to-end evaluations of such models, it often remains unclear what the models actually learn, and to what extent their reasoning about code matches that of skilled humans. A poor understanding of the model reasoning risks deploying models that are right for the wrong reason, and taking decisions based on spurious correlations in the training dataset. This paper investigates to what extent the attention weights of effective neural models match the reasoning of skilled humans. To this end, we present a methodology for recording human attention and use it to gather 1,508 human attention maps from 91 participants, which is the largest such dataset we are aware of. Computing human-model correlations shows that the copy attention of neural models often matches the way humans reason about code (Spearman rank coefficients of 0.49 and 0.47), which gives an empirical justification for the intuition behind copy attention. In contrast, the regular attention of models is mostly uncorrelated with human attention. We find that models and humans sometimes focus on different kinds of tokens, e.g., strings are important to humans but mostly ignored by models. The results also show that human-model agreement positively correlates with accurate predictions by a model, which calls for neural models that even more closely mimic human reasoning. Beyond the insights from our study, we envision the release of our dataset of human attention maps to help understand future neural models of code and to foster work on human-inspired models.

[1]  Michael Pradel,et al.  Semantic bug seeding: a learning-based approach for creating realistic bugs , 2021, ESEC/SIGSOFT FSE.

[2]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[3]  Yu Wang,et al.  Learning semantic program embeddings with graph interval neural network , 2020, Proc. ACM Program. Lang..

[4]  Georgios Gousios,et al.  TypeWriter: neural type prediction with search-based validation , 2020, ESEC/SIGSOFT FSE.

[5]  Byron C. Wallace,et al.  ERASER: A Benchmark to Evaluate Rationalized NLP Models , 2020, ACL.

[6]  Chris Lankford,et al.  Gazetracker: software designed to facilitate eye movement analysis , 2000, ETRA.

[7]  Junji Tomita,et al.  Multi-style Generative Reading Comprehension , 2019, ACL.

[8]  Jakob Grue Simonsen,et al.  A Diagnostic Study of Explainability Techniques for Text Classification , 2020, EMNLP.

[9]  Thomas Leich,et al.  A Look into Programmers’ Heads , 2020, IEEE Transactions on Software Engineering.

[10]  Hailong Sun,et al.  Learning to Handle Exceptions , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Hoa Khanh Dam,et al.  An Empirical Study of Model-Agnostic Techniques for Defect Prediction Models , 2020, IEEE Transactions on Software Engineering.

[12]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[13]  M. de Rijke,et al.  Understanding Multi-Head Attention in Abstractive Summarization , 2019, ArXiv.

[14]  Fabian Fagerholm,et al.  EMIP: The eye movements in programming dataset , 2020, Sci. Comput. Program..

[15]  Yann-Gaël Guéhéneuc,et al.  A practical guide on conducting eye tracking studies in software engineering , 2020, Empirical Software Engineering.

[16]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[17]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[18]  Francesca Toni,et al.  Human-grounded Evaluations of Explanation Methods for Text Classification , 2019, EMNLP.

[19]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[20]  Christian Bird,et al.  Deep learning type inference , 2018, ESEC/SIGSOFT FSE.

[21]  Philippe Cudré-Mauroux,et al.  MARTA: Leveraging Human Rationales for Explainable Text Classification , 2021, AAAI.

[22]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[23]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[25]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[28]  Zheng Gao,et al.  Typilus: neural type hints , 2020, PLDI.

[29]  Yijun Yu,et al.  AutoFocus: Interpreting Attention-Based Neural Networks by Code Perturbation , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[30]  Isil Dillig,et al.  LambdaNet: Probabilistic Type Inference using Graph Neural Networks , 2020, ICLR.

[31]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[32]  Le Song,et al.  Hoppity: Learning Graph Transformations to Detect and Fix Bugs in Programs , 2020, ICLR.

[33]  Michael Pradel,et al.  NL2Type: Inferring JavaScript Function Types from Natural Language Information , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[34]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[35]  Chandan Singh,et al.  Interpretations are useful: penalizing explanations to align neural networks with prior knowledge , 2019, ICML.

[36]  Andrew Begel,et al.  Using psycho-physiological measures to assess task difficulty in software development , 2014, ICSE.

[37]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[38]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[39]  Ngoc Thang Vu,et al.  Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension , 2020, CONLL.

[40]  Kazushi Ikeda,et al.  Towards Generation of Visual Attention Map for Source Code , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[41]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[42]  Gabriele Bavota,et al.  On Learning Meaningful Code Changes Via Neural Machine Translation , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[43]  Jonathan I. Maletic,et al.  iTrace: eye tracking infrastructure for development environments , 2018, ETRA.

[44]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[45]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[46]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[47]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[48]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[49]  Thomas Leich,et al.  Do background colors improve program comprehension in the #ifdef hell? , 2012, Empirical Software Engineering.

[50]  Michael Pradel,et al.  IdBench: Evaluating Semantic Representations of Identifier Names in Source Code , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[51]  Adina Williams,et al.  To what extent do human explanations of model behavior align with actual model behavior? , 2020, BLACKBOXNLP.

[52]  Andrew Slavin Ross,et al.  Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations , 2017, IJCAI.

[53]  Giedrius Burachas,et al.  Can You Explain That? Lucid Explanations Help Human-AI Collaborative Image Retrieval , 2019, HCOMP.

[54]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[55]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[56]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[57]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[58]  Collin McMillan,et al.  An Eye-Tracking Study of Java Programmers and Application to Source Code Summarization , 2015, IEEE Transactions on Software Engineering.

[59]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[60]  Michael Pradel,et al.  Neural Software Analysis , 2020, ArXiv.

[61]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[62]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[63]  Andrian Marcus,et al.  On the Use of Automated Text Summarization Techniques for Summarizing Source Code , 2010, 2010 17th Working Conference on Reverse Engineering.

[64]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[65]  Drew T. Guarnera Enhancing Eye Tracking of Source Code: A Specialized Fixation Filter for Source Code , 2019, 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[66]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).