Detecting ESG topics using domain-specific language models and data augmentation approaches

Despite recent advances in deep learning-based language modelling, many natural language processing (NLP) tasks in the financial domain remain challenging due to the paucity of appropriately labelled data. Other issues that can limit task performance are differences in word distribution between the general corpora - typically used to pre-train language models - and financial corpora, which often exhibit specialized language and symbology. Here, we investigate two approaches that may help to mitigate these issues. Firstly, we experiment with further language model pre-training using large amounts of in-domain data from business and financial news. We then apply augmentation approaches to increase the size of our dataset for model fine-tuning. We report our findings on an Environmental, Social and Governance (ESG) controversies dataset and demonstrate that both approaches are beneficial to accuracy in classification tasks.

[1]  Stefan Feuerriegel,et al.  News-based trading strategies , 2016, Decis. Support Syst..

[2]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[6]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[7]  Sung-Hyon Myaeng,et al.  Identifying Controversial Issues and Their Sub-topics in News Articles , 2010, PAISI.

[8]  Andreas Holzinger,et al.  Augmentor: An Image Augmentation Library for Machine Learning , 2017, J. Open Source Softw..

[9]  Jochen L. Leidner,et al.  Risk Mining: Company-Risk Identification from Unstructured Sources , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[10]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[11]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[12]  Jochen L. Leidner,et al.  A comparison of classification models for natural disaster and critical event detection from news , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[13]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[14]  Timo Busch,et al.  ESG and financial performance: aggregated evidence from more than 2000 empirical studies , 2015 .

[15]  Nebojsa Nakicenovic,et al.  Policy: Five priorities for the UN Sustainable Development Goals , 2015, Nature.

[16]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[17]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[18]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[19]  Shiri Dori-Hacohen,et al.  Probabilistic Approaches to Controversy Detection , 2016, CIKM.

[20]  Joachim Denzler,et al.  Deep Learning on Small Datasets without Pre-Training using Cosine Loss , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21]  Willem Schramade Investing in the UN Sustainable Development Goals: Opportunities for Companies and Investors , 2017 .

[22]  Dogu Araci,et al.  FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , 2019, ArXiv.

[23]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[24]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[25]  David D. Jensen,et al.  Controversy Detection in Wikipedia Using Collective Classification , 2016, SIGIR.

[26]  Ashby H. B. Monk,et al.  Integrating Alternative Data (Also Known as ESG Data) in Investment Decision Making , 2019, Global Economic Review.

[27]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[28]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[29]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[30]  Quoc V. Le,et al.  Unsupervised Data Augmentation , 2019, ArXiv.

[31]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.