On Extracting Structured Knowledge from Unstructured Business Documents

Efficient management of text data is a major concern of business organizations. In this direction, we propose a novel approach to extract structured knowledge from large corpora of unstructured business documents. This knowledge is represented in the form of object instances, which are common ways of organizing the available information about entities, and are modeled here using document templates. The approach itself is based on the observation that a significant fraction of these documents are created using the cut-copy-paste method, and thus, it is important to factor this observation into business document analysis projects. Correspondingly, our approach solves the problem of object instance extraction in two steps, namely similarity search and then extraction of object instances from the selected documents. Early qualitative results on a couple of carefully selected document corpora indicate the effective applicability of the approach for solving an important component of the efficient text management problem.

[1]  Michael W. Berry,et al.  GTP (General Text Parser) Software for Text Mining , 2003 .

[2]  Susan T. Dumais,et al.  O'brien. using linear algebra for intelligent information retrieval. technical report ut-cs-94-270 , 1994 .

[3]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[4]  Mihail Popescu,et al.  Using Co-Occurrence Data to Determine a Thesaurus Structure , 1998, AMIA.

[5]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[6]  Mukesh K. Mohania,et al.  Efficiently linking text documents with relevant structured information , 2006, VLDB.

[7]  Maria Teresa Pazienza Information Extraction: Towards Scalable, Adaptable Systems , 1999 .

[8]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[9]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10]  J. W. Hunt,et al.  An Algorithm for Differential File Comparison , 2008 .

[11]  Peter Eeles,et al.  Building Business Objects , 1998 .

[12]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[13]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[14]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[15]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[17]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[19]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[20]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[21]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..