ClimateBert: A Pretrained Language Model for Climate-Related Text

Over the recent years, large pretrained language models (LM) have revolutionized the field of natural language processing (NLP). However, while pretraining on general language has been shown to work very well for common language, it has been observed that niche language poses problems. In particular, climate-related texts include specific language that common LMs can not represent accurately. We argue that this shortcoming of today’s LMs limits the applicability of modern NLP to the broad field of text processing of climate-related texts. As a remedy, we propose CLIMATEBERT, a transformer-based language model that is further pretrained on over 1.6 million paragraphs of climate-related texts, crawled from various sources such as common news, research articles, and climate reporting of companies. We find that CLIMATEBERT leads to a 46% improvement on a masked language model objective which, in turn, leads to lowering error rates by 3.57% to 35.71% for various climate-related downstream tasks like text classification, sentiment analysis, and fact-checking.

[1]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[2]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[3]  Alexandra Luccioni,et al.  Analyzing Sustainability Reports Using Natural Language Processing , 2020, ArXiv.

[4]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Ziqian Xie,et al.  Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction , 2020, npj Digital Medicine.

[6]  Julia Anna Bingler,et al.  Cheap Talk and Cherry-Picking: What ClimateBert has to say on Corporate Climate Risk Disclosures , 2021, SSRN Electronic Journal.

[7]  Do-Yeon Kim,et al.  Analysis of Recognition of Climate Changes using Word2Vec , 2018 .

[8]  Michael Grüning Artificial Intelligence Measurement of Disclosure (AIMD) , 2011 .

[9]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[12]  Jannis Bulian,et al.  CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims , 2020, ArXiv.

[13]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[14]  Barbara Plank,et al.  Learning to select data for transfer learning with Bayesian Optimization , 2017, EMNLP.

[15]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[16]  Christopher M. Danforth,et al.  Climate Change Sentiment on Twitter: An Unsolicited Public Opinion Poll , 2015, PloS one.

[17]  Markus Leippold,et al.  Ask BERT: How Regulatory Disclosure of Transition and Physical Climate Risks affects the CDS Term Structure , 2020, Journal of Financial Econometrics.

[18]  Jordan L. Boyd-Graber,et al.  ClimaText: A Dataset for Climate Change Topic Detection , 2020, ArXiv.

[19]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[20]  Firm-level Climate Change Exposure , 2020 .

[21]  Lawrence G. Chillrud,et al.  Evidence based Automatic Fact-Checking for Climate Change Misinformation , 2021, ICWSM Workshops.

[22]  Ion Androutsopoulos,et al.  LEGAL-BERT: “Preparing the Muppets for Court’” , 2020, FINDINGS.

[23]  J. Minx,et al.  Systematic mapping of global research on climate and health: a machine learning review , 2021, The Lancet. Planetary health.

[24]  Dogu Araci,et al.  FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , 2019, ArXiv.

[25]  Lynn H. Kaack,et al.  Automated Identification of Climate Risk Disclosures in Annual Corporate Reports , 2021, ArXiv.

[26]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[27]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.