Deep Neural Networks Evolve Human-like Attention Distribution during Reading Comprehension

Attention is a key mechanism for information selection in both biological brains and many state-of-the-art deep neural networks (DNNs). Here, we investigate whether humans and DNNs allocate attention in comparable ways when reading a text passage to subsequently answer a specific question. We analyze 3 transformer-based DNNs that reach human-level performance when trained to perform the reading comprehension task. We find that the DNN attention distribution quantitatively resembles human attention distribution measured by fixation times. Human readers fixate longer on words that are more relevant to the question-answering task, demonstrating that attention is modulated by top-down reading goals, on top of lower-level visual and text features of the stimulus. Further analyses reveal that the attention weights in DNNs are also influenced by both top-down reading goals and lower-level stimulus features, with the shallow layers more strongly influenced by lower-level text features and the deep layers attending more to task-relevant words. Additionally, deep layers' attention to task-relevant words gradually emerges when pre-trained DNN models are fine-tuned to perform the reading comprehension task, which coincides with the improvement in task performance. These results demonstrate that DNNs can evolve human-like attention distribution through task optimization, which suggests that human attention during goal-directed reading comprehension is a consequence of task optimization.

[1]  Hai Zhao,et al.  Dual Co-Matching Network for Multi-choice Reading Comprehension , 2020, AAAI.

[2]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[3]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[4]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[5]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[6]  Chenxi Liu,et al.  Attention Correctness in Neural Image Captioning , 2016, AAAI.

[7]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[8]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  S. Pinker,et al.  On language and connectionism: Analysis of a parallel distributed processing model of language acquisition , 1988, Cognition.

[10]  Ali Borji,et al.  Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study , 2013, IEEE Transactions on Image Processing.

[11]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[12]  Nai Ding,et al.  PALRACE: Reading Comprehension Dataset with Human Data and Labeled Rationales , 2021, ArXiv.

[13]  James L. McClelland,et al.  The TRACE model of speech perception , 1986, Cognitive Psychology.

[14]  John Hale,et al.  Finding syntax in human encephalography with beam search , 2018, ACL.

[15]  Rajesh P. N. Rao,et al.  Bayesian inference and attentional modulation in the visual cortex , 2005, Neuroreport.

[16]  Peng Li,et al.  Option Comparison Network for Multiple-choice Reading Comprehension , 2019, ArXiv.

[17]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[18]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[19]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[20]  D. Ballard,et al.  Eye guidance in natural vision: reinterpreting salience. , 2011, Journal of vision.

[21]  G. McConkie,et al.  What guides a reader's eye movements? , 1976, Vision Research.

[22]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[23]  J. Wolfe,et al.  Five factors that guide attention in visual search , 2017, Nature Human Behaviour.

[24]  Peter Dayan,et al.  Inference, Attention, and Decision in a Bayesian Neural Architecture , 2004, NIPS.

[25]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[26]  Weiming Zhang,et al.  Neural Machine Reading Comprehension: Methods and Trends , 2019, Applied Sciences.

[27]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[28]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[29]  S. Kakade,et al.  Learning and selective attention , 2000, Nature Neuroscience.

[30]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[31]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[32]  Frédo Durand,et al.  A Benchmark of Computational Models of Saliency to Predict Human Fixations , 2012 .

[33]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[34]  W. Bruce Croft,et al.  Do People and Neural Nets Pay Attention to the Same Words: Studying Eye-tracking Data for Non-factoid QA Evaluation , 2020, CIKM.

[35]  D H Brainard,et al.  The Psychophysics Toolbox. , 1997, Spatial vision.

[36]  Matthias Bethge,et al.  Generalisation in humans and deep neural networks , 2018, NeurIPS.

[37]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[38]  M. Posner,et al.  The attention system of the human brain: 20 years after. , 2012, Annual review of neuroscience.

[39]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[40]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[41]  James L. McClelland,et al.  A distributed, developmental model of word recognition and naming. , 1989, Psychological review.

[42]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[43]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[44]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[46]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[47]  Erik D. Reichle,et al.  The E-Z Reader model of eye-movement control in reading: Comparisons to other models , 2003, Behavioral and Brain Sciences.

[48]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[49]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[50]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.