论文信息 - Automatic indexing of scanned documents: a layout-based approach

Automatic indexing of scanned documents: a layout-based approach

Archiving official written documents such as invoices, reminders and account statements in business and private area gets more and more important. Creating appropriate index entries for document archives like sender's name, creation date or document number is a tedious manual work. We present a novel approach to handle automatic indexing of documents based on generic positional extraction of index terms. For this purpose we apply the knowledge of document templates stored in a common full text search index to find index positions that were successfully extracted in the past.

[1] Tansel Özyer,et al. Employing Clustering Techniques for Automatic Information Extraction From HTML Documents , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[2] Jan-Ming Ho,et al. Discovering informative content blocks from Web documents , 2002, KDD.

[3] Eric Saund. Scientific challenges underlying production document processing , 2011, Electronic Imaging.

[4] Hwee Tou Ng,et al. Closing the Gap: Learning-Based Information Extraction Rivaling Knowledge-Engineering Methods , 2003, ACL.

[5] Sriram Raghavan,et al. Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[6] Wei-Ying Ma,et al. Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[7] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[8] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[9] Rohit J. Kate,et al. Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[10] Jianying Hu,et al. Comparison and Classification of Documents Based on Layout Similarity , 2000, Information Retrieval.

[11] Anastasia Ailamaki,et al. Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[12] Li Zhang,et al. Focused named entity recognition using machine learning , 2004, SIGIR '04.