Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning

We use Bayesian optimization to learn curricula for word representation learning, optimizing performance on downstream tasks that depend on the learned representations as features. The curricula are modeled by a linear ranking function which is the scalar product of a learned weight vector and an engineered feature vector that characterizes the different aspects of the complexity of each instance in the training corpus. We show that learning the curriculum improves performance on a variety of downstream tasks over random orders and in comparison to the natural corpus order.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  E. H. Simpson Measurement of Diversity , 1949, Nature.

[3]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[4]  Jonas Mockus,et al.  On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[5]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .

[6]  R. Kail The development of memory in Children , 1979 .

[7]  Calyampudi R. Rao Diversity and dissimilarity coefficients: A unified approach☆ , 1982 .

[8]  Michael Wilson,et al.  MRC psycholinguistic database: Machine-usable dictionary, version 2.00 , 1988 .

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[11]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[12]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[13]  W. Bossert,et al.  The Measurement of Diversity , 2001 .

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[16]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[17]  Mari Ostendorf,et al.  Reading Level Assessment Using Support Vector Machines and Statistical Language Models , 2005, ACL.

[18]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[19]  A. Solow,et al.  Measuring biological diversity , 2006, Environmental and Ecological Statistics.

[20]  Yasemin Altun,et al.  Broad-Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger , 2006, EMNLP.

[21]  Maxine Eskénazi,et al.  Combining Lexical and Grammatical Features to Improve Readability Measures for First and Second Language Texts , 2007, NAACL.

[22]  A. Stirling A general framework for analysing diversity in science, technology and society , 2007, Journal of The Royal Society Interface.

[23]  Ani Nenkova,et al.  Revisiting Readability: A Unified Framework for Predicting Text Quality , 2008, EMNLP.

[24]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[25]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[26]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[27]  Valentin I. Spitkovsky,et al.  From Baby Steps to Leapfrog: How “Less is More” in Unsupervised Dependency Parsing , 2010, NAACL.

[28]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[29]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[30]  Yong Jae Lee,et al.  Learning the easy things first: Self-paced visual category discovery , 2011, CVPR 2011.

[31]  Nando de Freitas,et al.  Portfolio Allocation for Bayesian Optimization , 2010, UAI.

[32]  Stephen Clark,et al.  Syntactic Processing Using the Generalized Perceptron and Beam Search , 2011, CL.

[33]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[34]  Philipp Hennig,et al.  Entropy Search for Information-Efficient Global Optimization , 2011, J. Mach. Learn. Res..

[35]  M. Brysbaert,et al.  Age-of-acquisition ratings for 30,000 English words , 2012, Behavior research methods.

[36]  Walt Detmar Meurers,et al.  On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition , 2012, BEA@NAACL-HLT.

[37]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[38]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[39]  Lucia Specia,et al.  Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation , 2013, ACL.

[40]  Trevor Cohn,et al.  A temporal model of text periodicities using Gaussian Processes , 2013, EMNLP.

[41]  Gregory Shakhnarovich,et al.  A Systematic Exploration of Diversity in Machine Translation , 2013, EMNLP.

[42]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[43]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[44]  Yulia Tsvetkov,et al.  Metaphor Detection with Cross-Lingual Model Transfer , 2014, ACL.

[45]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[46]  Dani Yogatama,et al.  Bayesian Optimization of Text Representations , 2015, EMNLP.

[47]  Noah A. Smith,et al.  Transition-Based Dependency Parsing with Stack Long Short-Term Memory , 2015, ACL.

[48]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[49]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[50]  Katja Gruenewald,et al.  Species Diversity In Space And Time , 2016 .

[51]  B. Skinner,et al.  The Behavior of Organisms: An Experimental Analysis , 2016 .

[52]  Dean P. Foster,et al.  Semantic Word Clusters Using Signed Normalized Graph Cuts , 2016, ArXiv.