Affinitention nets: kernel perspective on attention architectures for set classification with applications to medical text and images

Set classification is the task of predicting a single label from a set comprising multiple instances. The examples we consider are pathology slides represented by sets of patches and medical text data represented by sets of word embeddings. State-of-the-art methods, such as the transformer network, typically use attention mechanisms to learn representations of set data, by modeling interactions between instances of the set. These methods, however, have complex heuristic architectures comprising multiple heads and layers. The complexity of attention architectures hampers their training when only a small number of labeled sets is available, as is often the case in medical applications. To address this problem, we present a kernel-based representation learning framework that links learning affinity kernels to learning representations from attention architectures. We show that learning a combination of the sum and the product of kernels is equivalent to learning representations from multi-head multi-layer attention architectures. From our framework, we devise a simplified attention architecture which we term affinitention (affinity-attention) nets. We demonstrate the application of affinitention nets to the classification of the Set-Cifar10 dataset, thyroid malignancy prediction from pathology slides, as well as patient text-message triage. We show that affinitention nets provide competitive results compared to heuristic attention architectures and outperform other competing methods.

[1]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[2]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[3]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[4]  Ruigang Yang,et al.  LiDAR-Based Online 3D Video Object Detection With Graph-Based Message Passing and Spatiotemporal Transformer Attention , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Max Welling,et al.  Attention-based Deep Multiple Instance Learning , 2018, ICML.

[6]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[7]  Pedro Antonio Gutiérrez,et al.  Ordinal Classification Using Hybrid Artificial Neural Networks with Projection and Kernel Basis Functions , 2012, HAIS.

[8]  Pramodita Sharma 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.

[9]  Brendan J. Frey,et al.  Classifying and segmenting microscopy images with deep multiple instance learning , 2015, Bioinform..

[10]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[11]  Guy Cazuguel,et al.  Multiple-Instance Learning for Medical Image and Video Analysis , 2017, IEEE Reviews in Biomedical Engineering.

[12]  Joshua M. Lewis,et al.  Multi-view kernel construction , 2010, Machine Learning.

[13]  Roy R. Lederman,et al.  Learning the geometry of common latent variables using alternating-diffusion , 2015 .

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xiaojun Wan,et al.  Attention-based LSTM Network for Cross-Lingual Sentiment Classification , 2016, EMNLP.

[17]  Israel Cohen,et al.  Kernel-Based Sensor Fusion With Application to Audio-Visual Voice Activity Detection , 2016, IEEE Transactions on Signal Processing.

[18]  S. Hewitt,et al.  1980 , 1980, Literatur in der SBZ/DDR.

[19]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[20]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[21]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[22]  John Collomosse,et al.  Sketchformer: Transformer-Based Representation for Sketched Structure , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[24]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[25]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[26]  Alex Fout,et al.  Protein Interface Prediction using Graph Convolutional Networks , 2017, NIPS.

[27]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[28]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[29]  Joonseok Lee,et al.  N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classification , 2018, UAI.

[30]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[31]  Lawrence Carin,et al.  Thyroid Cancer Malignancy Prediction From Whole Slide Cytopathology Images , 2019, MLHC.

[32]  Lawrence Carin,et al.  Weakly supervised instance learning for thyroid malignancy prediction from whole slide cytopathology images , 2019, Medical Image Anal..

[33]  Dimitris N. Metaxas,et al.  Rethinking Kernel Methods for Node Representation Learning on Graphs , 2019, NeurIPS.

[34]  Arie Yeredor,et al.  MultiView Diffusion Maps , 2015, Inf. Fusion.

[35]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[40]  Guoyin Wang,et al.  Students Need More Attention: BERT-based AttentionModel for Small Data with Application to AutomaticPatient Message Triage , 2020, MLHC.

[41]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[42]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[43]  Zhizhen Zhao,et al.  LanczosNet: Multi-Scale Deep Graph Convolutional Networks , 2019, ICLR.

[44]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[45]  Bo Wang,et al.  Unsupervised metric fusion by cross diffusion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Lawrence Carin,et al.  Application of a machine learning algorithm to predict malignancy in thyroid cytopathology , 2020, Cancer cytopathology.

[47]  Karen Livescu,et al.  Nonparametric Canonical Correlation Analysis , 2015, ICML.

[48]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[49]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[50]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[51]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[52]  S. Hewitt,et al.  2006 , 2018, Los 25 años de la OMC: Una retrospectiva fotográfica.

[53]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.