Forecasting Future World Events with Neural Networks

Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing thousands of forecasting questions and an accompanying news corpus. Questions are taken from forecasting tournaments, ensuring high quality, real-world importance, and diversity. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts (avoiding leakage from the future). Motivated by the difficulty of forecasting numbers across orders of magnitude (e.g. global cases of COVID-19 in 2022), we also curate IntervalQA, a dataset of numerical questions and metrics for calibration. We test language models on our forecasting task and find that performance is far below a human expert baseline. However, performance improves with increased model size and incorporation of relevant information from the news corpus. In sum, Autocast poses a novel challenge for large language models and improved performance could bring large practical benefits.

[1]  Dan Hendrycks,et al.  X-Risk Analysis for AI Research , 2022, ArXiv.

[2]  Owain Evans,et al.  Teaching Models to Express Their Uncertainty in Words , 2022, Trans. Mach. Learn. Res..

[3]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[4]  B. Mellers,et al.  False dichotomy alert: Improving subjective-probability estimates vs. raising awareness of systemic risk , 2022, International Journal of Forecasting.

[5]  Hannaneh Hajishirzi,et al.  UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training , 2022, ArXiv.

[6]  Tobias Gerstenberg,et al.  Uncalibrated Models Can Improve Human-AI Collaboration , 2022, NeurIPS.

[7]  Jeff Wu,et al.  WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[8]  Ram Rajagopal,et al.  NeuralProphet: Explainable Forecasting at Scale , 2021, ArXiv.

[9]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[10]  Owain Evans,et al.  Truthful AI: Developing and governing AI that does not lie , 2021, ArXiv.

[11]  Nicholas Carlini,et al.  Unsolved Problems in ML Safety , 2021, ArXiv.

[12]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[13]  Wenhu Chen,et al.  A Dataset for Answering Time-Sensitive Questions , 2021, NeurIPS Datasets and Benchmarks.

[14]  Xiaohua Zhai,et al.  Revisiting the Calibration of Modern Neural Networks , 2021, NeurIPS.

[15]  Brent Harrison,et al.  Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior , 2021, ArXiv.

[16]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[17]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[18]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[19]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[20]  Yaneer Bar-Yam,et al.  On single point forecasts for fat-tailed variables , 2020, International Journal of Forecasting.

[21]  David A. McAllester,et al.  On-The-Fly Information Retrieval Augmentation for Language Models , 2020, NUSE.

[22]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[23]  Tengyu Ma,et al.  Individual Calibration with Randomized Forecasting , 2020, ICML.

[24]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[25]  Prasad Tadepalli,et al.  Avoiding Side Effects in Complex Environments , 2020, NeurIPS.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[28]  Rahul Khanna,et al.  ForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data , 2020, ACL.

[29]  Fotios Petropoulos,et al.  Forecasting in social settings: The state of the art , 2020, International Journal of Forecasting.

[30]  S. Levine,et al.  Learning Human Objectives by Evaluating Hypothetical Behavior , 2019, ICML.

[31]  Peter Eckersley,et al.  SafeLife 1.0: Exploring Side Effects in Complex Environments , 2019, SafeAI@AAAI.

[32]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[33]  Peter A. Flach,et al.  Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration , 2019, NeurIPS.

[34]  Thomas G. Dietterich,et al.  Deep Anomaly Detection with Outlier Exposure , 2018, ICLR.

[35]  John M. Reilly,et al.  Modeling Uncertainty in Integrated Assessment of Climate Change: A Multimodel Comparison , 2018, Journal of the Association of Environmental and Resource Economists.

[36]  W. Nordhaus,et al.  Uncertainty in forecasts of long-run economic growth , 2018, Proceedings of the National Academy of Sciences.

[37]  Dario Amodei,et al.  AI safety via debate , 2018, ArXiv.

[38]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[39]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[40]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[41]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[42]  Brendan T. O'Connor,et al.  Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.

[43]  Philip Tetlock,et al.  The psychology of intelligence analysis: drivers of prediction accuracy in world politics. , 2015, Journal of experimental psychology. Applied.

[44]  Philip E. Tetlock,et al.  On the Difference between Binary Prediction and True Exposure with Implications for Forecasting Tournaments and Decision Making Research , 2013 .

[45]  Sven Ove Hansson,et al.  Fallacies of risk , 2004 .

[46]  Tessaleno C. Devezas,et al.  Principles of Forecasting. A Handbook for Researchers and Practitioners , 2002 .

[47]  J Hedlund,et al.  Risky business: safety regulations, risk compensation, and individual behavior , 2000, Injury prevention : journal of the International Society for Child and Adolescent Injury Prevention.

[48]  M. O'Connor,et al.  Judgemental and statistical time series forecasting: a review of the literature , 1996 .

[49]  Steven C. Wheelwright,et al.  Forecasting methods and applications. , 1979 .

[50]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[51]  Norman Meuschke,et al.  news-please - A Generic News Crawler and Extractor , 2017, ISI.

[52]  Lisa Werner,et al.  Principles of forecasting: A handbook for researchers and practitioners , 2002 .

[53]  Philip E. Tetlock,et al.  Superforecasting: The Art and Science of Prediction , 2015 .

[54]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .