Boosting Information Extraction through Semantic Technologies: The KIDs use case at CONSOB

In this paper we report on the initial results of a project concerning the integration of Semantic Technologies with Information Extraction (IE) techniques, jointly carried out by Sapienza University of Rome and CONSOB (Commissione Nazionale per la Società e la Borsa), the Italian public authority responsible for regulating the securities market. The use case. In the EU, the creators of financial products (a.k.a. financial manufacturers) are obliged by law3 to make information related to so-called PRIIPs (Packaged Retail Investment and Insurance-based Investments Products) publicly available. The NCAs (National Competent Authorities) have supervisory duties on such products, so that they can be safely placed on the respective national markets. The legislation requires information about PRIIPs to be communicated to NCAs through documents called KIDs (Key Information Documents). In the practice, this means that features to be checked are cast into text reports, typically formatted as pdf files, and extracting structured data from them (to bootstrap control activities), is actually in charge to the authority (In Italy, CONSOB). Due to the massive amount of documents to be analyzed (e.g., ∼700.000 KIDs received by CONSOB in 2019, more than 1 million in 2020), this process cannot be carried out manually, but still it is only partially automated to date. Objectives. Our main aim is thus to develop a solution to streamline the extraction process and reduce as much as possible (ideally eliminate) the need of manual intervention, still guaranteeing very high accuracy. At the same time, such solution should return a data structure providing a due account of the semantics of the business domain and suited for rich and highly informative post-extraction analysis. Solution. Given the previously highlighted requirements, the proposed solution aims at constructing a Knowledge Graph (KG), whose intensional component (expressed in OWL) is designed with the help of domain experts, and whose extensional level is automatically created from KIDs through a rule-based IE mechanism. The choice of structuring the extracted data as a KG not only facilitates the integration with other corporate and external data, enabling rich analysis and management at an abstract, conceptual level, but also allows for properly formalizing the conceptual distinction between PRIIPs and KIDs describing them, and the continuous updates which KIDs are subjected to.