Extraction of Structure and Content from the Edgar Database: A Template-Based Approach

This paper presents a template‐based approach to extract data from the EDGAR database. A set of heuristic‐based templates is used to configure the trainable system in order to have one type of EDGAR filings processed in a single configuration. Such configurability is highly desirable as it adds expendability and flexibility to this system. The template‐based approach also enables the system to extract both structural information and content from the filings in the EDGAR database. The ability to extract structural information from a section or a complete filing makes it possible to collect data from real‐world documents for users of financial data in both academia and industry. We use the income statement section of 10‐K filings to illustrate the system and the utilization of the template‐based approach.

[1]  Larry Wall,et al.  Programming Perl - covers Perl 5, 2nd Edition , 1996, A nutshell handbook.

[2]  Wang Ya-lin Document ANalysis: Table Structure Understanding and Zone Content Classification , 2002 .

[3]  William Kornfeld,et al.  Automatically locating, extracting and analyzing tabular data , 1998, SIGIR '98.

[4]  Richard C. Sansing,et al.  Valuation of the Firm in the Presence of Temporary Book‐Tax Differences: The Role of Deferred Tax Assets and Liabilities , 2000 .

[5]  Alexander Kogan,et al.  Design and Applications of an Intelligent Financial Reporting and Auditing Agent with Net Knowledge (FRAANK) , 2002 .

[6]  John H. Gerdes,et al.  EDGAR-Analyzer: automating the analysis of corporate data contained in the SEC's EDGAR database , 2003, Decis. Support Syst..

[7]  M. G. Bader,et al.  Design and applications , 2000 .

[8]  Miklos A. Vasarhelyi,et al.  Does the Year 2000 XBRL Taxonomy Accommodate Current Business Financial-Reporting Practice? , 2002, J. Inf. Syst..

[9]  Shona Douglas,et al.  Layout and language: preliminary investigations in recognizing the structure of tables , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[10]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.

[11]  Robert M. Haralick,et al.  Document structure analysis and performance evaluation , 1999 .

[12]  Douglas E. Appelt,et al.  Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  M. Vasarhelyi THE CONTINUOUS AUDIT OF ONLINE SYSTEMS , 1991 .

[15]  Miklos A. Vasarhelyi,et al.  Virtual auditing agents: the EDGAR Agent challenge , 2000, Decis. Support Syst..

[16]  Miklos A. Vasarhelyi,et al.  Financial Reporting and Auditing Agent with Net Knowledge (FRAANK) and eXtensible Business Reporting Language (XBRL) , 2005, J. Inf. Syst..

[17]  Hwee Tou Ng,et al.  Learning to Recognize Tables in Free Text , 1999, ACL.

[18]  Thomas Kieninger,et al.  Document Structure Analysis Based on Layout and Textual Features , 2000 .

[19]  Ingrid E. Fisher On the Structure of Financial Accounting Standards to Support Digital Representation, Storage, and Retrieval , 2004 .

[20]  Shona Douglas,et al.  Using Natural Language Processing for Identifying and Interpreting Tables in Plain Text , 2007 .

[21]  Doug Bowman,et al.  Accounting trends and techniques, 61st annual survey 2007 edition , 2007 .

[22]  Linda S. McDaniel,et al.  Effects of Comprehensive‐Income Characteristics on Nonprofessional Investors' Judgments: The Role of Financial‐Statement Presentation Format , 2000 .