Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification

Prominently used in support vector machines and logistic regressions, kernel functions (kernels) can implicitly map data points into high dimensional spaces and make it easier to learn complex decision boundaries. In this work, by replacing the inner product function in the softmax layer, we explore the use of kernels for contextual word classification. In order to compare the individual kernels, experiments are conducted on standard language modeling and machine translation tasks. We observe a wide range of performances across different kernel settings. Extending the results, we look at the gradient properties, investigate various mixture strategies and examine the disambiguation abilities.

[1]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Masaaki Nagata,et al.  Direct Output Connection for a High-Rank Language Model , 2018, EMNLP.

[4]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Yasuhiro Fujiwara,et al.  Sigsoftmax: Reanalysis of the Softmax Bottleneck , 2018, NeurIPS.

[7]  Hermann Ney,et al.  Investigation on Estimation of Sentence Probability by Combining Forward, Backward and Bi-directional LSTM-RNNs , 2018, INTERSPEECH.

[8]  Geoffrey E. Hinton,et al.  Gated Softmax Classification , 2010, NIPS.

[9]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[10]  Andrew M. Dai,et al.  Embedding Text in Hyperbolic Spaces , 2018, TextGraphs@NAACL-HLT.

[11]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[12]  Ammar Alqatari Convolutional Sequence to Sequence Learning to Improve Nanopore Basecalling Efficiency , 2018 .

[13]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[14]  Hermann Ney,et al.  Improving Neural Language Models with Weight Norm Initialization and Regularization , 2018, WMT.

[15]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[16]  Li Zhang,et al.  Wavelet support vector machine , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[18]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[21]  Andrew Gordon Wilson,et al.  Multimodal Word Distributions , 2017, ACL.

[22]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[23]  Nozha Boujemaa,et al.  Conditionally Positive Definite Kernels for SVM Based Image Recognition , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[24]  Andrew McCallum,et al.  Word Representations via Gaussian Embedding , 2014, ICLR.

[25]  Hsuan-Tien Lin A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods , 2005 .

[26]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[27]  Johanna D. Moore,et al.  NAACL-Demonstrations '07 Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations , 2007 .

[28]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[29]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[30]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.