Visualizing the PHATE of Neural Networks

Understanding why and how certain neural networks outperform others is key to guiding future development of network architectures and optimization methods. To this end, we introduce a novel visualization algorithm that reveals the internal geometry of such networks: Multislice PHATE (M-PHATE), the first method designed explicitly to visualize how a neural network's hidden representations of data evolve throughout the course of training. We demonstrate that our visualization provides intuitive, detailed summaries of the learning dynamics beyond simple global measures (i.e., validation loss and accuracy), without the need to access validation data. Furthermore, M-PHATE better captures both the dynamics and community structure of the hidden units as compared to visualization based on standard dimensionality reduction methods (e.g., ISOMAP, t-SNE). We demonstrate M-PHATE with two vignettes: continual learning and generalization. In the former, the M-PHATE visualizations display the mechanism of "catastrophic forgetting" which is a major challenge for learning in task-switching contexts. In the latter, our visualizations reveal how increased heterogeneity among hidden units correlates with improved generalization performance. An implementation of M-PHATE, along with scripts to reproduce the figures in this paper, is available at this https URL.

[1]  David van Dijk,et al.  Visualizing Transitions and Structure for Biological Data Exploration , 2017 .

[2]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[3]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[4]  Jukka-Pekka Onnela,et al.  Community Structure in Time-Dependent, Multiscale, and Multiplex Networks , 2009, Science.

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[7]  Jason Morton,et al.  When Does a Mixture of Products Contain a Product of Mixtures? , 2012, SIAM J. Discret. Math..

[8]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Qing Wang,et al.  Using Diffusion Geometric Coordinates for Hyperspectral Imagery Representation , 2009, IEEE Geoscience and Remote Sensing Letters.

[11]  C. Bachoc,et al.  Applied and Computational Harmonic Analysis Tight P-fusion Frames , 2022 .

[12]  Roy R. Lederman,et al.  Learning the geometry of common latent variables using alternating-diffusion , 2015 .

[13]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[14]  Raanan Fattal,et al.  Diffusion maps for edge-aware image editing , 2010, SIGGRAPH 2010.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[19]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[20]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[21]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[22]  Péter Koltai,et al.  Understanding the geometry of transport: Diffusion maps for Lagrangian trajectory data unravel coherent sets. , 2016, Chaos.

[23]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[24]  Arie Yeredor,et al.  MultiView Diffusion Maps , 2015, Inf. Fusion.

[25]  Matthew J. Hirn,et al.  Time Coupled Diffusion Maps , 2016, Applied and Computational Harmonic Analysis.

[26]  Israel Cohen,et al.  Single-Channel Transient Interference Suppression With Diffusion Maps , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Ronald R. Coifman,et al.  Diffusion maps for changing data , 2012, ArXiv.

[28]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2018, Neural Networks.

[29]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[30]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[31]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[32]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[33]  Israel Cohen,et al.  Multiscale Anomaly Detection Using Diffusion Maps , 2013, IEEE Journal of Selected Topics in Signal Processing.

[34]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[35]  Nicolas Le Roux,et al.  The Curse of Highly Variable Functions for Local Kernel Machines , 2005, NIPS.

[36]  Yen-Cheng Liu,et al.  Re-evaluating Continual Learning Scenarios: A Categorization and Case for Strong Baselines , 2018, ArXiv.

[37]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[38]  Ronald R. Coifman,et al.  Hierarchical Coupled-Geometry Analysis for Neuronal Structure and Activity Pattern Discovery , 2015, IEEE Journal of Selected Topics in Signal Processing.

[39]  Stefan Wermter,et al.  Continual Lifelong Learning with Neural Networks: A Review , 2019, Neural Networks.

[40]  Arie Yeredor,et al.  Multi-view diffusion maps , 2020, Inf. Fusion.