Dealing with a large number of classes -- Likelihood, Discrimination or Ranking?

We consider training probabilistic classifiers in the case of a large number of classes. The number of classes is assumed too large to perform exact normalisation over all classes. To account for this we consider a simple approach that directly approximates the likelihood. We show that this simple approach works well on toy problems and is competitive with recently introduced alternative non-likelihood based approximations. Furthermore, we relate this approach to a simple ranking objective. This leads us to suggest a specific setting for the optimal threshold in the ranking objective.

[1]  C. Geyer On the Convergence of Monte Carlo Maximum Likelihood Calculations , 1994 .

[2]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[3]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[4]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[5]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[6]  Thomas E. Booth,et al.  Unbiased Monte Carlo Estimation of the Reciprocal of an Integral , 2007 .

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[9]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[10]  Pradeep Dubey,et al.  BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies , 2015, ICLR.

[11]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[12]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.