Sampled Softmax with Random Fourier Features

The computational cost of training with softmax cross entropy loss grows linearly with the number of classes. For the settings where a large number of classes are involved, a common method to speed up training is to sample a subset of classes and utilize an estimate of the loss gradient based on these classes, known as the sampled softmax method. However, the sampled softmax provides a biased estimate of the gradient unless the samples are drawn from the exact softmax distribution, which is again expensive to compute. Therefore, a widely employed practical approach involves sampling from a simpler distribution in the hope of approximating the exact softmax distribution. In this paper, we develop the first theoretical understanding of the role that different sampling distributions play in determining the quality of sampled softmax. Motivated by our analysis and the work on kernel-based sampling, we propose the Random Fourier Softmax (RF-softmax) method that utilizes the powerful Random Fourier Features to enable more efficient and accurate sampling from an approximate softmax distribution. We show that RF-softmax leads to low bias in estimation in terms of both the full softmax distribution and the full softmax gradient. Furthermore, the cost of RF-softmax scales only logarithmically with the number of classes.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xavier Bouthillier,et al.  Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets , 2014, NIPS.

[3]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[4]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[5]  Pascal Vincent,et al.  An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family , 2015, ICLR.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Xinhua Zhang,et al.  DS-MLR: Exploiting Double Separability for Scaling up Distributed Multinomial Logistic Regression , 2016, ArXiv.

[8]  Pradeep Ravikumar,et al.  Loss Decomposition for Fast Learning in Large Output Spaces , 2018, ICML.

[9]  Sebastian Fedden,et al.  Extreme classification , 2018, Cognitive Linguistics.

[10]  Jian Cheng,et al.  NormFace: L2 Hypersphere Embedding for Face Verification , 2017, ACM Multimedia.

[11]  Garud Iyengar,et al.  Unbiased scalable softmax optimization , 2018, ArXiv.

[12]  Sashank J. Reddi,et al.  Stochastic Negative Mining for Learning with Large Output Spaces , 2018, AISTATS.

[13]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[14]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[15]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[16]  Stefano Ermon,et al.  Fast Amortized Inference and Learning in Log-linear Models with Randomly Perturbed Nearest Neighbor Search , 2017, UAI.

[17]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[18]  Guy Blanc,et al.  Adaptive Sampled Softmax with Kernel Based Sampling , 2017, ICML.

[19]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[20]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[21]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[22]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[23]  Thomas Gärtner,et al.  Probabilistic Structured Predictors , 2009, UAI.

[24]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[25]  Carlos D. Castillo,et al.  L2-constrained Softmax Loss for Discriminative Face Verification , 2017, ArXiv.

[26]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[27]  Jonathon Shlens,et al.  Deep Networks With Large Output Spaces , 2014, ICLR.

[28]  Zoltán Szabó,et al.  Optimal Rates for Random Fourier Features , 2015, NIPS.

[29]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[30]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[31]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Manik Varma,et al.  Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications , 2016, KDD.

[33]  Sanjiv Kumar,et al.  Spherical Random Features for Polynomial Kernels , 2015, NIPS.