Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
暂无分享,去创建一个
Dan Hendrycks | Scott Emmons | Steven Basart | Hanlin Zhang | Andy Zou | Alexander Pan | Chan Jun Shern | Nathaniel Li | Thomas Woodside | Jonathan Ng
[1] Chenfei Wu,et al. TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs , 2023, Intelligent Computing.
[2] Dan Hendrycks. Natural Selection Favors AIs over Humans , 2023, ArXiv.
[3] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.
[4] Luke Zettlemoyer,et al. Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.
[5] Jungo Kasai,et al. Batch Prompting: Efficient Inference with Large Language Model APIs , 2023, ArXiv.
[6] Alexander H. Miller,et al. Human-level play in the game of Diplomacy by combining language models with strategic reasoning , 2022, Science.
[7] Christopher D. Manning,et al. Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.
[8] Lisa Anne Hendricks,et al. Taxonomy of Risks posed by Language Models , 2022, FAccT.
[9] Joseph Carlsmith. Is Power-Seeking AI an Existential Risk? , 2022, ArXiv.
[10] Dan Hendrycks,et al. X-Risk Analysis for AI Research , 2022, ArXiv.
[11] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.
[12] Sergio Gomez Colmenarejo,et al. A Generalist Agent , 2022, Trans. Mach. Learn. Res..
[13] Yejin Choi,et al. Aligning to Social Norms and Values in Interactive Narratives , 2022, NAACL.
[14] S. Levine,et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , 2022, CoRL.
[15] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.
[16] D. Song,et al. What Would Jiminy Cricket Do? Towards Agents That Behave Morally , 2021, NeurIPS Datasets and Benchmarks.
[17] Nicholas Carlini,et al. Unsolved Problems in ML Safety , 2021, ArXiv.
[18] Owain Evans,et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.
[19] Ashutosh Modi,et al. Pre-trained Language Models as Prior Knowledge for Playing Text-based Games , 2021, AAMAS.
[20] Mark O. Riedl,et al. Learning Knowledge Graph-based World Models of Textual Environments , 2021, NeurIPS.
[21] Brent Harrison,et al. Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior , 2021, ArXiv.
[22] Liu Yang,et al. Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.
[23] Matthew J. Hausknecht,et al. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , 2020, ICLR.
[24] Matthew J. Hausknecht,et al. Keep CALM and Explore: Language Models for Action Generation in Text-based Games , 2020, EMNLP.
[25] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.
[26] D. Song,et al. Aligning AI With Shared Human Values , 2020, ICLR.
[27] Matthew J. Hausknecht,et al. How to Avoid Being Eaten by a Grue: Structured Exploration Strategies for Textual Worlds , 2020, ArXiv.
[28] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.
[29] Renuka Kumawat. The Third Pillar: How Markets and the State Leave the Community Behind , 2020 .
[30] William L. Hamilton,et al. Learning Dynamic Belief Graphs to Generalize on Text-Based Games , 2020, NeurIPS.
[31] Matthew J. Hausknecht,et al. Graph Constrained Reinforcement Learning for Natural Language Action Spaces , 2020, ICLR.
[32] Matthew J. Hausknecht,et al. Interactive Fiction Games: A Colossal Adventure , 2019, AAAI.
[33] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[34] Matthew J. Hausknecht,et al. NAIL: A General Interactive Fiction Agent , 2019, ArXiv.
[35] Matthew J. Hausknecht,et al. TextWorld: A Learning Environment for Text-based Games , 2018, CGW@IJCAI.
[36] Shie Mannor,et al. Reward Constrained Policy Optimization , 2018, ICLR.
[37] Yuval Tassa,et al. Safe Exploration in Continuous Action Spaces , 2018, ArXiv.
[38] Ufuk Topcu,et al. Safe Reinforcement Learning via Shielding , 2017, AAAI.
[39] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.
[40] Anca D. Dragan,et al. Cooperative Inverse Reinforcement Learning , 2016, NIPS.
[41] Thomas Hobbes,et al. Thomas Hobbes , 1981, A New Modern Philosophy.
[42] Mark O. Riedl,et al. Using Stories to Teach Human Values to Artificial Agents , 2016, AAAI Workshop: AI, Ethics, and Society.
[43] Jianfeng Gao,et al. Deep Reinforcement Learning with a Natural Language Action Space , 2015, ACL.
[44] Andrea Lockerd Thomaz,et al. Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.
[45] Chris Arney. Antifragile: Things That Gain from Disorder , 2013 .
[46] Katharina Pistor. A Legal Theory of Finance , 2013 .
[47] Debraj Ray,et al. Linking Conflict to Inequality and Polarization , 2011 .
[48] Linda D. Molm. The Structure of Reciprocity , 2010 .
[49] Shimshon Bichler,et al. Capital as Power: A Study of Order and Creorder , 2009 .
[50] K. Neckerman,et al. Inequality: Causes and Consequences , 2007 .
[51] R. Dahl. The concept of power , 2007 .
[52] Uri Gneezy,et al. Deception: The Role of Consequences , 2005 .
[53] Mark W. Baldwin,et al. Relational schemas and the processing of social information. , 1992 .
[54] Dario Amodei,et al. Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .
[55] Susumu Cato,et al. Capital in the Twenty-First Century , 2016 .
[56] Luke Muehlhauser,et al. Intelligence Explosion: Evidence and Import , 2012 .
[57] C. Allen,et al. Stanford Encyclopedia of Philosophy , 2011 .
[58] A. T. Parsons,et al. On the Concept of Political Power , 2008 .
[59] I. Razafimahefa,et al. Purchasing Power Parity , 2007 .
[60] GLOSSARY OF INDUSTRIAL ORGANISATION ECONOMICS AND COMPETITION LAW , 1999 .
[61] M. Castells. The Power of Identity , 1997 .
[62] Charles Darwin,et al. On the origin of species, 1859 , 1988 .
[63] J. S. Dowker,et al. Fundamentals of Physics , 1970, Nature.
[64] J. R. French,et al. The bases of social power. , 1959 .
[65] A. Maslow. A Theory of Human Motivation , 1943 .
[66] Bertrand Russell,et al. Power: A New Social Analysis , 1938 .
[67] N. Pierce. Origin of Species , 1914, Nature.