Leaf-Smoothed Hierarchical Softmax for Ordinal Prediction

We propose a new approach to conditional probability estimation for ordinal labels. First, we present a specialized hierarchical softmax variant inspired by k-d trees that leverages the inherent spatial structure of (potentially-multivariate) ordinal labels. We then adapt ideas from signal processing on noisy graphs to develop a novel regularizer for such hierarchical softmax models. Both our tree structure and regularizer independently boost the sample efficiency of a deep learning model across a series of simulation studies. Furthermore, the combination of these two techniques produces additive gains and the model does not suffer from the pathologies of other approaches in the literature. We validate our approach empirically on a suite of real-world datasets, in some cases reducing the error by nearly half in comparison to other popular methods in the literature. Our results demonstrate that our method is a powerful new modeling technique for conditional probability estimation of ordinal labels, especially in the low-to-mid sample size regimes such as those often found in biological and other physical sciences.

[1]  Mohammad Norouzi,et al.  Pixel Recursive Super Resolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[3]  Aditya Deshpande,et al.  Learning Diverse Image Colorization , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[5]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[6]  James W. Taylor A Quantile Regression Neural Network Approach to Estimating the Conditional Density of Multiperiod Returns , 2000 .

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  C. Bishop Mixture density networks , 1994 .

[9]  Donovan Lieu,et al.  Spatial Adaptation in Trend Filtering , 2017 .

[10]  Charles Elkan,et al.  Predicting Surgery Duration with Neural Heteroscedastic Regression , 2017, MLHC.

[11]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[12]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[13]  Stephen P. Boyd,et al.  1 Trend Filtering , 2009, SIAM Rev..

[14]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[15]  Yu-Xiang Wang,et al.  Total Variation Classes Beyond 1d: Minimax Rates, and the Limitations of Linear Smoothers , 2016, NIPS.

[16]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[17]  Ronald M. Summers,et al.  Deep Learning in Medical Imaging: Overview and Future Promise of an Exciting New Technique , 2016 .

[18]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[19]  Junichi Yamagishi,et al.  SUPERSEDED - CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2016 .

[20]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[21]  Barnabás Póczos,et al.  Enabling Dark Energy Science with Deep Generative Models of Galaxy Images , 2016, AAAI.

[22]  Adler J. Perotte,et al.  Deep Survival Analysis , 2016, MLHC.

[23]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[24]  R. Tibshirani Adaptive piecewise polynomial estimation via trend filtering , 2013, 1304.2986.

[25]  Alexander J. Smola,et al.  Trend Filtering on Graphs , 2014, J. Mach. Learn. Res..

[26]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[27]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[28]  Yoshua Bengio,et al.  A Hybrid Pareto Mixture for Conditional Asymmetric Fat-Tailed Distributions , 2009, IEEE Transactions on Neural Networks.

[29]  Hugo Larochelle,et al.  Neural Autoregressive Distribution Estimation , 2016, J. Mach. Learn. Res..

[30]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[31]  Yang Yang,et al.  Bagging binary and quantile predictors for time series , 2006 .

[32]  Andrew W. Moore,et al.  Nonparametric Density Estimation: Toward Computational Tractability , 2003, SDM.