Minimax and Neyman–Pearson Meta-Learning for Outlier Languages

Model-agnostic meta-learning (MAML) has been recently put forth as a strategy to learn resource-poor languages in a sample-efficient fashion. Nevertheless, the properties of these languages are often not well represented by those available during training. Hence, we argue that the i.i.d. assumption ingrained in MAML makes it ill-suited for cross-lingual NLP. In fact, under a decision-theoretic framework, MAML can be interpreted as minimising the expected risk across training languages (with a uniform prior), which is known as Bayes criterion. To increase its robustness to outlier languages, we create two variants of MAML based on alternative criteria: Minimax MAML reduces the maximum risk across languages, while Neyman–Pearson MAML constrains the risk in each language to a maximum threshold. Both criteria constitute fully differentiable two-player games. In light of this, we propose a new adaptive optimiser solving for a local approximation to their Nash equilibrium. We evaluate both model variants on two popular NLP tasks, part-of-speech tagging and question answering. We report gains for their average and minimum performance across low-resource languages in zeroand few-shot settings, compared to joint multisource transfer and vanilla MAML. The code for our experiments is available at https:// github.com/rahular/robust-maml.

[1]  Florian Schäfer,et al.  Competitive Gradient Descent , 2019, NeurIPS.

[2]  Sridhar Mahadevan,et al.  Global Convergence to the Equilibrium of GANs using Variational Inequalities , 2018, ArXiv.

[3]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[4]  A. McCallum,et al.  Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks , 2019, COLING.

[5]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[6]  Jörg Tiedemann,et al.  Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels , 2015, DepLing.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Emily M. Bender Linguistically Naïve != Language Independent: Why NLP Needs Linguistic Typology , 2009 .

[9]  Thierry Poibeau,et al.  Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing , 2018, Computational Linguistics.

[10]  Kjell A. Doksum,et al.  Mathematical Statistics: Basic Ideas and Selected Topics, Volume I, Second Edition , 2015 .

[11]  Graham Neubig,et al.  Balancing Training for Multilingual Neural Machine Translation , 2020, ACL.

[12]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[13]  Lorien Y. Pratt,et al.  Discriminability-Based Transfer between Neural Networks , 1992, NIPS.

[14]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[15]  Sebastian Ruder,et al.  Neural transfer learning for natural language processing , 2019 .

[16]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[17]  Anna Korhonen,et al.  On the Relation between Linguistic Typology and (Limitations of) Multilingual Language Modeling , 2018, EMNLP.

[18]  Guoxin Wang,et al.  Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources , 2019, AAAI.

[19]  Xin Zhang,et al.  Worst-Case-Aware Curriculum Learning for Zero and Few Shot Transfer , 2020, ArXiv.

[20]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[21]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[22]  Qianchu Liu,et al.  XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[23]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[24]  S. Zabell Symmetry and its discontents : essays on the history of inductive probability , 2005 .

[25]  Mark Dredze,et al.  Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT , 2019, EMNLP.

[26]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[27]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[28]  Sebastian Ruder,et al.  A survey of cross-lingual embedding models , 2017, ArXiv.

[29]  Ryan Cotterell,et al.  Towards Zero-shot Language Modeling , 2019, EMNLP.

[30]  Sébastien M. R. Arnold,et al.  learn2learn: A Library for Meta-Learning Research , 2020, ArXiv.

[31]  Ryan Cotterell,et al.  Probabilistic Typology: Deep Generative Models of Vowel Inventories , 2017, ACL.

[32]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[33]  Amir Globerson,et al.  Nightmare at test time: robust learning by feature deletion , 2006, ICML.

[34]  Ryan Cotterell,et al.  Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages , 2021, Transactions of the Association for Computational Linguistics.

[35]  G. Lock The State and I , 1981 .

[36]  Aryan Mokhtari,et al.  Task-Robust Model-Agnostic Meta-Learning , 2020, NeurIPS.

[37]  Anna Korhonen,et al.  Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction , 2018, TACL.