Domain Adaption of Named Entity Recognition to Support Credit Risk Assessment

Risk assessment is a crucial activity for financial institutions because it helps them to determine the amount of capital they should hold to assure their stability. Flawed risk assessment models could return erroneous results that trigger a misuse of capital by banks and in the worst case, their collapse. Robust models need large amounts of data to return accurate predictions, the source of which is text-based financial documents. Currently, bank staff extract the relevant data by hand, but the task is expensive and timeconsuming. This paper explores a machine learning approach for information extraction of credit risk attributes from financial documents, modelling the task as a named-entity recognition problem. Generally, statistical approaches require labelled data for learn the models, however the annotation task is expensive and tedious. We propose a solution for domain adaption for NER based on out-of-domain data, coupled with a small amount of in-domain data. We also developed a financial NER dataset from publicly-available financial documents.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[3]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[4]  Timothy Baldwin,et al.  Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representations on Sequence Labelling Tasks , 2015, CoNLL.

[5]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[6]  Hamish Cunningham,et al.  Information Extraction, Automatic , 2006 .

[7]  Panagiotis Stamatopoulos,et al.  RULE-BASED NAMED ENTITY RECOGNITION FOR GREEK FINANCIAL TEXTS , 2000 .

[8]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[9]  Witold Abramowicz,et al.  Information Extraction from Free-Text Business Documents , 2003, Effective Databases for Text & Document Management.

[10]  Karin M. Verspoor,et al.  What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages , 2014, EMNLP.

[11]  Sumali Conlon,et al.  A Rule-Based System to Extract Financial Information , 2012, J. Comput. Inf. Syst..

[12]  Rebecca Hwa,et al.  Syntax-based Semi-Supervised Named Entity Tagging , 2005, ACL.

[13]  Phil Blunsom Structured classification for multilingual natural language processing , 2007 .

[14]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[15]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[16]  Marie-Francine Moens,et al.  Information extraction from legal texts: the potential of discourse analysis , 1999, Int. J. Hum. Comput. Stud..

[17]  ChengXiang Zhai,et al.  Exploiting Domain Structure for Named Entity Recognition , 2006, NAACL.

[18]  Thomas Clarke,et al.  RECURRING CRISES IN ANGLO-AMERICAN CORPORATE GOVERNANCE , 2010 .

[19]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.