A Position-Based Method for the Extraction of Financial Information in PDF Documents

Financial documents are omnipresent and necessitate extensive human efforts in order to extract, validate and export their content. Considering the high importance of such data for effective business decisions, the need for accuracy goes beyond any attempt to accelerate the process or save resources. While many methods have been suggested in the literature, the problem to automatically extract reliable financial data remains difficult to solve in practice and even more challenging to implement in a real life context. This difficulty is driven by the specific nature of financial text where relevant information is principally contained in tables of varying formats. Table Extraction (TE) is considered as an essential but difficult step for restructuring data in a handleable format by identifying and decomposing table components. In this paper, we present a novel method for extracting financial information by the means of two simple heuristics. Our approach is based on the idea that the position of information, in unstructured but visually rich documents - as it is the case for the Portable Document Format (PDF) - is an indicator of semantic relatedness. This solution has been developed in partnership with the Caisse de Depot et Placement du Québec. We present here our method and its evaluation on a corpus of 600 financial documents, where an F-measure of 91% is reached.

[1]  Katharina Kaiser,et al.  pdf2table: A Method to Extract Table Information from PDF Files , 2005, IICAI.

[2]  Wojciech Skut,et al.  Intelligent Information Extraction , 2000 .

[3]  Paul Thomas,et al.  Towards Searching Amongst Tables , 2015, ADCS.

[4]  Thierry Poibeau,et al.  Multi-source, Multilingual Information Extraction and Summarization , 2012, Theory and Applications of Natural Language Processing.

[5]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[6]  Andreas Nutz,et al.  eXtensible Business Reporting Language (XBRL) , 2002, Wirtsch..

[7]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[8]  Allan Hanbury,et al.  Scaling Up High-Value Retrieval to Medium-Volume Data , 2010, IRFC.

[9]  Tamir Hassan,et al.  Document understanding of graphical content in natively digital PDF documents , 2012, DocEng '12.

[10]  Jakub Piskorski,et al.  Information Extraction: Past, Present and Future , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[11]  Kalina Bontcheva,et al.  SVM Based Learning System for Information Extraction , 2004, Deterministic and Statistical Methods in Machine Learning.

[12]  Paul A. Griffin,et al.  Got Information? Investor Response to Form 10-K and Form 10-Q EDGAR Filings , 2003 .

[13]  Richard G. Gibson Regret Minimization in Games and the Development of Champion Multiplayer Computer Poker-Playing Agents , 2014 .

[14]  Matthew Francis Hurst,et al.  The interpretation of tables in texts , 2000 .

[15]  Miklos A. Vasarhelyi,et al.  Extraction of Structure and Content from the Edgar Database: A Template-Based Approach , 2007 .

[16]  John Shawe-Taylor,et al.  The Perceptron Algorithm with Uneven Margins , 2002, ICML.

[17]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[18]  Sumali Conlon,et al.  A Rule-Based System to Extract Financial Information , 2012, J. Comput. Inf. Syst..

[19]  Tamir Hassan,et al.  Object-level document analysis of PDF files , 2009, DocEng '09.

[20]  Yu Zhou,et al.  Financial named entity recognition based on conditional random fields and information entropy , 2014, 2014 International Conference on Machine Learning and Cybernetics.

[21]  Luís Torgo,et al.  Automatic Selection of Table Areas in Documents for Information Extraction , 2003, EPIA.

[22]  Hassan Alam,et al.  A Pattern Recognition approach to automated XBRL extraction , 2012, 2012 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr).