LILA: A Unified Benchmark for Mathematical Reasoning
暂无分享,去创建一个
Oyvind Tafjord | Ashish Sabharwal | Peter Clark | S. Welleck | Tanmay Rajpurohit | A. Kalyan | Swaroop Mishra | Chitta Baral | Matthew Finlayson | Pan Lu | Leonard Tang
[1] Swaroop Mishra,et al. How Many Data Samples is an Additional Instruction Worth? , 2022, FINDINGS.
[2] Matt Gardner,et al. QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension , 2021, ACM Comput. Surv..
[3] Song-Chun Zhu,et al. Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning , 2022, ICLR.
[4] Peter Clark,et al. NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks , 2022, ACL.
[5] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.
[6] Yejin Choi,et al. Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics , 2021, AAAI.
[7] Yejin Choi,et al. Reframing Instructional Prompts to GPTk’s Language , 2021, FINDINGS.
[8] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[9] Hannaneh Hajishirzi,et al. Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.
[10] Jesse Michael Han,et al. Proof Artifact Co-training for Theorem Proving with Language Models , 2021, ICLR.
[11] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.
[12] Song-Chun Zhu,et al. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning , 2021, NeurIPS Datasets and Benchmarks.
[13] Song-Chun Zhu,et al. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , 2021, ACL.
[14] Lawrence C. Paulson,et al. IsarStep: a Benchmark for High-level Mathematical Reasoning , 2021, ICLR.
[15] Yejin Choi,et al. NaturalProofs: Mathematical Theorem Proving in Natural Language , 2021, NeurIPS Datasets and Benchmarks.
[16] Yejin Choi,et al. UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark , 2021, AAAI.
[17] Stella Biderman,et al. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .
[18] Navin Goyal,et al. Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.
[19] Sonal Gupta,et al. Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.
[20] Jimmy Ba,et al. INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving , 2020, ICLR.
[21] Dan Roth,et al. Do Language Embeddings capture Scales? , 2020, BLACKBOXNLP.
[22] Keh-Yih Su,et al. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.
[23] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[24] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.
[25] Bill Yuchen Lin,et al. Birds Have Four Legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models , 2020, EMNLP.
[26] Dawn Song,et al. Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.
[27] Sameer Singh,et al. ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension , 2019, ArXiv.
[28] Dan Roth,et al. “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.
[29] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[30] Gabriel Stanovsky,et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.
[31] Carolyn Penstein Rosé,et al. EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference , 2019, CoNLL.
[32] Peter Clark,et al. QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships , 2018, AAAI.
[33] K. Ramasubramanian,et al. Use of calculus in Hindu mathematics , 2019, Sources and Studies in the History of Mathematics and Physical Sciences.
[34] Graham Neubig,et al. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).
[35] Leonidas J. Guibas,et al. Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[36] Dan Roth,et al. Mapping to Declarative Knowledge for Word Problem Solving , 2017, TACL.
[37] Wang Ling,et al. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.
[38] Dan Roth,et al. Unit Dependency Graph and Its Application to Arithmetic Word Problem Solving , 2016, AAAI.
[39] Ming-Wei Chang,et al. Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems , 2016, EMNLP.
[40] Wei-Ying Ma,et al. How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation , 2016, ACL.
[41] Hannaneh Hajishirzi,et al. MAWPS: A Math Word Problem Repository , 2016, NAACL.
[42] Jason Weston,et al. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.
[43] Dan Roth,et al. Solving General Arithmetic Word Problems , 2016, EMNLP.
[44] Oren Etzioni,et al. Parsing Algebraic Word Problems into Equations , 2015, TACL.
[45] Ming-Wei Chang,et al. DRAW: A Challenging and Diverse Algebra Word Problem Set , 2015 .
[46] Dan Roth,et al. Reasoning about Quantities in Natural Language , 2015, TACL.
[47] Oren Etzioni,et al. Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.
[48] Luke S. Zettlemoyer,et al. Learning to Automatically Solve Algebra Word Problems , 2014, ACL.
[49] Benoît Sagot,et al. Crowdsourcing for Language Resource Development: Critical Analysis of Amazon Mechanical Turk Overpowering Use , 2011, LTC 2011.
[50] B. Sarkar. Hindu Achievements in Exact Science: A Study in the History of Scientific Development , 2006, Nature.
[51] Yejin Choi,et al. Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics , 2021, AAAI.
[52] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.
[53] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.
[54] Yejin Choi,et al. NaturalProofs: Mathematical Theorem Proving in Natural Language , 2021, NeurIPS Datasets and Benchmarks.
[55] Stella Biderman,et al. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .
[56] Jimmy Ba,et al. INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving , 2020, ICLR.
[57] Dan Roth,et al. Do Language Embeddings capture Scales? , 2020, BLACKBOXNLP.
[58] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[59] Dan Roth,et al. “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.
[60] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[61] Graham Neubig,et al. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).
[62] Leonidas J. Guibas,et al. Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[63] Ming-Wei Chang,et al. Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems , 2016, EMNLP.
[64] Jason Weston,et al. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.
[65] Ming-Wei Chang,et al. DRAW: A Challenging and Diverse Algebra Word Problem Set , 2015 .
[66] Benoît Sagot,et al. Crowdsourcing for Language Resource Development: Critical Analysis of Amazon Mechanical Turk Overpowering Use , 2011, LTC 2011.