Jura: Towards Automatic Compliance Assessment for Annual Reports of Listed Companies

The initial public offering (IPO) market in Hong Kong is consistently one of the largest in the world. As part of its regulatory responsibilities, Hong Kong Exchanges and Clearing Limited (HKEX) reviews annual reports published by listed companies (issuers). The number of issuers has grown at a fast pace, reaching 2,538 as the end of 2020. This poses a challenge for manually reviewing these annual reports against the many diverse regulatory obligations (listing rules). We propose a system named Jura to improve the efficiency of annual report reviewing with the help of machine learning methods. This system checks the compliance of an issuer's published information against listing rules in four steps: panoptic document recognition, relevant passage location, fine-grained information extraction, and compliance assessment. This paper introduces in detail the passage location step, how it is critical for speeding up compliance assessment, and the various challenges faced. We argue that although a passage is a relatively independent unit, it needs to be combined with document structure and contextual information to accurately locate the relevant passages. With the help of Jura, HKEX reports saving 80% of the time on reviewing issuers' annual reports.

[1]  Arash Habibi Lashkari,et al.  A Boolean Model in Information Retrieval for Search Engines , 2009, 2009 International Conference on Information Management and Engineering.

[2]  Kai Li,et al.  Cross-Domain Document Object Detection: Benchmark Suite and Method , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ping Luo,et al.  Towards Automatic Numerical Cross-Checking: Extracting Formulas from Text , 2018, WWW.

[4]  Ersin Yumer,et al.  Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[6]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[7]  Ping Luo,et al.  Cracking Tabular Presentation Diversity for Automatic Cross-Checking over Numerical Facts , 2020, KDD.

[8]  Michelangelo Ceci,et al.  Machine Learning for Reading Order Detection in Document Image Understanding , 2008, Machine Learning in Document Analysis and Recognition.

[9]  John K. C. Kingston Using artificial intelligence to support compliance with the general data protection regulation , 2017, Artificial Intelligence and Law.

[10]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[11]  Tunga Güngör,et al.  A Tree Learning Approach to Web Document Sectional Hierarchy Extraction , 2010, ICAART.

[12]  Nora El-Gohary,et al.  Semantic NLP-Based Information Extraction from Construction Regulatory Documents for Automated Compliance Checking , 2016, J. Comput. Civ. Eng..

[13]  Zhoujun Li,et al.  TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[14]  Alfred Z. Spector,et al.  The Camelot project , 1986 .

[15]  Antonio Jimeno-Yepes,et al.  PubLayNet: Largest Dataset Ever for Document Layout Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[16]  Keishi Tajima,et al.  Extracting Logical Hierarchical Structure of HTML Documents Based on Headings , 2015, Proc. VLDB Endow..

[17]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[18]  Ping Luo,et al.  Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application , 2021, Journal of Computer Science and Technology.

[19]  Wanita Sherchan,et al.  Cognitive Compliance: Assessing Regulatory Risk in Financial Advice Documents , 2020, AAAI.

[20]  Sira Ferradans,et al.  Table-Of-Contents generation on contemporary documents , 2019, ICDAR.

[21]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[22]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[23]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[24]  Jean-Luc Meunier,et al.  Optimized XY-cut for determining a page reading order , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[25]  Alan Conway,et al.  Page grammars and page parsing. A syntactic approach to document layout recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[26]  Kristin L. Sainani,et al.  Logistic Regression , 2014, PM & R : the journal of injury, function, and rehabilitation.

[27]  Muhammad Mahbubur Rahman,et al.  Understanding the Logical and Semantic Structure of Large Documents , 2017, SDM 2017.