Quantifying Long Range Dependence in Language and User Behavior to improve RNNs

Characterizing temporal dependence patterns is a critical step in understanding the statistical properties of sequential data. Long Range Dependence (LRD) --- referring to long-range correlations decaying as a power law rather than exponentially w.r.t. distance --- demands a different set of tools for modeling the underlying dynamics of the sequential data. While it has been widely conjectured that LRD is present in language modeling and sequential recommendation, the amount of LRD in the corresponding sequential datasets has not yet been quantified in a scalable and model-independent manner. We propose a principled estimation procedure of LRD in sequential datasets based on established LRD theory for real-valued time series and apply it to sequences of symbols with million-item-scale dictionaries. In our measurements, the procedure estimates reliably the LRD in the behavior of users as they write Wikipedia articles and as they interact with YouTube. We further show that measuring LRD better informs modeling decisions in particular for RNNs whose ability to capture LRD is still an active area of research. The quantitative measure informs new Evolutive Recurrent Neural Networks (EvolutiveRNNs) designs, leading to state-of-the-art results on language understanding and sequential recommendation tasks at a fraction of the computational cost.

[1]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[2]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[3]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[4]  Hugues Bersini,et al.  Long and Short-Term Recommendations with Recurrent Neural Networks , 2017, UMAP.

[5]  Murad S. Taqqu,et al.  Theory and applications of long-range dependence , 2003 .

[6]  Shlomo Shamai,et al.  Mutual information and minimum mean-square error in Gaussian channels , 2004, IEEE Transactions on Information Theory.

[7]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8]  Elena Smirnova,et al.  Contextual Sequence Modeling for Recommendation with Recurrent Neural Networks , 2017, DLRS@RecSys.

[9]  D. Sornette Critical Phenomena in Natural Sciences: Chaos, Fractals, Selforganization and Disorder: Concepts and Tools , 2000 .

[10]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[11]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12]  Ed H. Chi,et al.  Factorized Recurrent Neural Architectures for Longer Range Dependence , 2018, AISTATS.

[13]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[14]  Gennady Samorodnitsky,et al.  Long Range Dependence , 2007, Found. Trends Stoch. Syst..

[15]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Quoc V. Le,et al.  Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  Jan Beran,et al.  Statistics for long-memory processes , 1994 .

[20]  Benoit B. Mandelbrot,et al.  Fractals and Scaling in Finance , 1997 .

[21]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[22]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[23]  Vladas Pipiras,et al.  Long-Range Dependence and Self-Similarity , 2017 .

[24]  D. Brillinger Time series - data analysis and theory , 1981, Classics in applied mathematics.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  É. Moulines,et al.  Log-Periodogram Regression Of Time Series With Long Range Dependence , 1999 .

[27]  Max Tegmark,et al.  Criticality in Formal Languages and Statistical Physics , 2016 .

[28]  John Miller,et al.  When Recurrent Models Don't Need To Be Recurrent , 2018, ArXiv.

[29]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[30]  Yann LeCun,et al.  Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs , 2016, ICML.

[31]  Alex Beutel,et al.  Recurrent Recommender Networks , 2017, WSDM.

[32]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[33]  Le Song,et al.  Deep Fried Convnets , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[35]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[36]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[39]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[40]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[41]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[42]  Thomas S. Huang,et al.  Dilated Recurrent Neural Networks , 2017, NIPS.

[43]  Samuel S. Schoenholz,et al.  Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks , 2018, ICML.

[44]  Ed H. Chi,et al.  Towards Neural Mixture Recommender for Long Range Dependent User Sequences , 2019, WWW.

[45]  Alexandre M. Bayen,et al.  Random projection design for scalable implicit smoothing of randomly observed stochastic processes , 2017, AISTATS.

[46]  Alexandros Karatzoglou,et al.  Session-based Recommendations with Recurrent Neural Networks , 2015, ICLR.

[47]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[48]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[49]  Alexandros Karatzoglou,et al.  Personalizing Session-based Recommendations with Hierarchical Recurrent Neural Networks , 2017, RecSys.

[50]  Shlomo Shamai,et al.  Additive non-Gaussian noise channels: mutual information and conditional mean estimation , 2005, Proceedings. International Symposium on Information Theory, 2005. ISIT 2005..

[51]  John R. Anderson,et al.  Folding: Why Good Models Sometimes Make Spurious Recommendations , 2017, RecSys.

[52]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[53]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[54]  Christopher Joseph Pal,et al.  On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[55]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.