Semisupervised Wrapper Choice and Generation for Print-Oriented Documents

Information extraction from printed documents is still a crucial problem in many interorganizational workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario well, as printed documents do not carry any explicit structural or syntactical description. Moreover, printed documents usually lack any explicit indication about their source. We present a system, which we call PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists, and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of normal system operation: new wrappers are generated online based on a few point-and-click operations performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very good performance on a challenging data set composed of more than 600 printed documents drawn from three different application domains: invoices, datasheets of electronic components, and patents. We also perform an extensive analysis of the crucial tradeoff between accuracy and automation level.

[1]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[2]  Juliana Freire,et al.  Organizing Hidden-Web Databases by Clustering Visible Web Documents , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[4]  Yuan An,et al.  Understanding deep web search interfaces: a survey , 2010, SGMD.

[5]  Masayuki Mukunoki,et al.  Table form document analysis based on the document structure grammar , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[6]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[7]  Pasquale De Meo,et al.  Web Data Extraction , Applications and Techniques : A Survey , 2010 .

[8]  Francesca Cesarini,et al.  Analysis and understanding of multi-class invoices , 2003, Document Analysis and Recognition.

[9]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[10]  Marco Aiello,et al.  Document understanding for a broad class of documents , 2002, Int. J. Document Anal. Recognit..

[11]  Eric Medvet,et al.  A probabilistic approach to printed document understanding , 2011, International Journal on Document Analysis and Recognition (IJDAR).

[12]  Bertin Klein,et al.  Results of a Study on Invoice-Reading Systems in Germany , 2004, Document Analysis Systems.

[13]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[14]  Y. Bela,et al.  Morphological Tagging Approach in Document Analysis of Invoices , 2004 .

[15]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.

[16]  Yolande Belaïd,et al.  Case-Based Reasoning for Invoice Analysis and Recognition , 2007, ICCBR.

[17]  Dorothea Blostein,et al.  A survey of document image classification: problem statement, classifier architecture and performance evaluation , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[18]  Eric Medvet,et al.  Open world classification of printed invoices , 2010, DocEng '10.

[19]  Boaz Ophir,et al.  A Generic Form Processing Approach for Large Variant Templates , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[20]  Andreas Dengel,et al.  Seizing the Treasure: Transferring Knowledge in Invoice Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[21]  Massimo Ruffolo,et al.  XONTO: An Ontology-Based System for Semantic Information Extraction from PDF Documents , 2008, 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

[22]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[23]  Elio Masciari,et al.  A Fuzzy Logic Approach to Wrapping PDF Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[24]  Naohiro Furukawa,et al.  Form reading based on form-type identification and form-data recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[25]  Hanchuan Peng,et al.  Document Image Recognition Based on Template Matching of Component Block Projections , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Cesare Alippi,et al.  An adaptive system for automatic invoice-documents classification , 2005, IEEE International Conference on Image Processing 2005.

[27]  Clement T. Yu,et al.  Automatic integration of Web search interfaces with WISE-Integrator , 2004, The VLDB Journal.

[28]  Yolande Belaïd,et al.  Morphological Tagging Approach in Document Analysis of Invoices , 2004, ICPR.

[29]  Shlomo Argamon,et al.  Building a test collection for complex document information processing , 2006, SIGIR.

[30]  Valter Crescenzi,et al.  Wrapper Generation for Overlapping Web Sources , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[31]  Tamir Hassan User-Guided Wrapping of PDF Documents Using Graph Matching Techniques , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[32]  E. Medvet,et al.  A domain knowledge-based approach for automatic correction of printed invoices , 2012, International Conference on Information Society (i-Society 2012).

[33]  Jan P. Allebach,et al.  Document visual similarity measure for document search , 2011, DocEng '11.

[34]  Bidyut Baran Chaudhuri,et al.  Incremental classification of invoice documents , 2008, 2008 19th International Conference on Pattern Recognition.