Table-processing paradigms: a research survey

Tables are a ubiquitous form of communication. While everyone seems to know what a table is, a precise, analytical definition of “tabularity” remains elusive because some bureaucratic forms, multicolumn text layouts, and schematic drawings share many characteristics of tables. There are significant differences between typeset tables, electronic files designed for display of tables, and tables in symbolic form intended for information retrieval. Most past research has addressed the extraction of low-level geometric information from raster images of tables scanned from printed documents, although there is growing interest in the processing of tables in electronic form as well. Recent research on table composition and table analysis has improved our understanding of the distinction between the logical and physical structures of tables, and has led to improved formalisms for modeling tables. This review, which is structured in terms of generalized paradigms for table processing, indicates that progress on half-a-dozen specific research issues would open the door to using existing paper and electronic tables for database update, tabular browsing, structured information retrieval through graphical and audio interfaces, multimedia table editing, and platform-independent display.

[1]  Cui Tao,et al.  Automating the extraction of data from HTML tables with unknown structure , 2005, Data Knowl. Eng..

[2]  Thomas Kieninger,et al.  Three approaches to "industrial" table spotting , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[3]  Daniel P. Lopresti,et al.  Table structure recognition and its evaluation , 2000, IS&T/SPIE Electronic Imaging.

[4]  Shona Douglas,et al.  Using Natural Language Processing for Identifying and Interpreting Tables in Plain Text , 2007 .

[5]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[6]  Daniel P. Lopresti,et al.  Why table ground-truthing is hard , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[7]  Shona Douglas,et al.  Layout and language: preliminary investigations in recognizing the structure of tables , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[8]  Naoki Asada,et al.  Graph grammar based analysis system of complex table form document , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[9]  Vishal Misra,et al.  Efficient interpretation of tabular documents , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[10]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[11]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[12]  W. Bruce Croft,et al.  TINTIN: a system for retrieval in text tables , 1997, DL '97.

[13]  Thomas Bayer Understanding structured text documents by a model based document analysis system , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[14]  William A. Barrett,et al.  Consensus-based table form recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[15]  池内 健二,et al.  Document preparation system , 2006 .

[16]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[17]  J. Cordy,et al.  A Survey of Table Recognition : Models , Observations , Transformations , and Inferences , 2003 .

[18]  Daniela Rus,et al.  Using White Space for Automated Document Structuring , 1994 .

[19]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Toyohide Watanabe,et al.  Recognition of Document Structure on the Basis of Spatial and Geometric Relationships between Document Items , 1990, MVA.

[21]  Daniel P. Lopresti,et al.  Document Analysis Systems V , 2002, Lecture Notes in Computer Science.

[22]  Vishal Misra,et al.  Detection of Horizontal Lines in Noisy Run Length Encoded Images: The FAST Method , 1995, GREC.

[23]  H.S. Baird,et al.  A retargetable table reader , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[24]  M. E. Lesk,et al.  TB1—a program to format tables , 1990 .

[25]  Karl Tombre,et al.  Graphics Recognition Methods and Applications , 1995, Lecture Notes in Computer Science.

[26]  Fuad Rahman,et al.  When is a List is a List?: Web Page Re-authoring for Small Display Devices , 2003, WWW.

[27]  Y. Hirayama,et al.  A method for table structure analysis using DP matching , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[28]  Devika Subramanian,et al.  Customizing information capture and access , 1997, TOIS.

[29]  Bertin Klein,et al.  Problem-adaptable document analysis and understanding for high-volume applications , 2004, Document Analysis and Recognition.

[30]  Vishal Misra,et al.  Interpreting and representing tabular documents , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Katsuhiko Itonori,et al.  Table structure recognition based on textblock arrangement and ruled line position , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[32]  Rangachar Kasturi,et al.  Information extraction from tabular drawings , 1994, Electronic Imaging.

[33]  Tomohiro Yoshikawa,et al.  Region segmentation for table image with unknown complex structure , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[34]  Candace L. Sidner,et al.  Email overload: exploring personal information management of email , 1996, CHI.

[35]  Suzanne M. Embury,et al.  An Agent-Based System for Handling Distributed Design Constraints , 1998 .

[36]  Leslie Lamport,et al.  Latex : A Document Preparation System , 1985 .

[37]  William Kornfeld,et al.  Automatically locating, extracting and analyzing tabular data , 1998, SIGIR '98.

[38]  Daniel P. Lopresti,et al.  A Tabular Survey of Automated Table Processing , 1999, GREC.

[39]  E. F. Codd,et al.  A Relational Model for Large Shared Data Banks , 1970 .

[40]  Jonathan J. Hull,et al.  Document Recognition IV , 1997 .

[41]  Anjo Anjewierden AIDAS: incremental logical structure discovery in PDF documents , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[42]  Jian Fan,et al.  Layout and Content Extraction for PDF Documents , 2004, Document Analysis Systems.

[43]  M. Armon Rahgozar,et al.  Graph-based table recognition system , 1996, Electronic Imaging.

[44]  John T. Guthrie,et al.  Roles of Document Structure, Cognitive Strategy, and Awareness in Searching for Information. , 1991 .

[45]  Jean Camillerapp,et al.  Making handwritten archives documents accessible to public with a generic system of document image analysis , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[46]  Thomas G Kieninger,et al.  Table structure recognition based on robust block segmentation , 1998, Electronic Imaging.

[47]  R. Sproat,et al.  Emu: an e-mail preprocessor for text-to-speech , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[48]  Konstantin Zuyev Table image segmentation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[49]  W. Alex Gray,et al.  Detection Approaches for Table Semantics in Text , 2002, Document Analysis Systems.

[50]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[51]  Thomas Kieninger,et al.  Applying the T-Recs table recognition system to the business letter domain , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[52]  K. Larson,et al.  The Technology of Text , 2007, IEEE Spectrum.

[53]  Bon K. Sy,et al.  A Theoretical Foundation and a Method for Document Table Structure Extraction and Decompositon , 2002, Document Analysis Systems.

[54]  Bertin Klein,et al.  Understanding document analysis and understanding (through modeling) , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[55]  Edward A. Green,et al.  Model-based analysis of printed tables , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[56]  Lawrence O'Gorman,et al.  The Document Spectrum for Page Layout Analysis , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Toyohide Watanabe,et al.  Layout Recognition of Multi-Kinds of Table-Form Documents , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  T. Watanabe,et al.  A framework for validating recognized results in understanding table-form document images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[59]  Bing Liu,et al.  New method for logical structure extraction of form document image , 1999, Electronic Imaging.

[60]  Matthew Francis Hurst,et al.  The interpretation of tables in texts , 2000 .

[61]  Francesca Cesarini,et al.  Trainable table location in document images , 2002, Object recognition supported by user interaction for service robots.

[62]  Andreas Dengel,et al.  Document Analysis Systems VI , 2004, Lecture Notes in Computer Science.

[63]  David Maier,et al.  The Theory of Relational Databases , 1983 .

[64]  Bertrand Coüasnon DMOS: a generic document recognition method, application to an automatic generator of musical scores, mathematical formulae and table structures recognition systems , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[65]  A. Laurentini,et al.  Identifying and understanding tabular material in compound documents , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[66]  Rangachar Kasturi,et al.  Structural recognition of tabulated data , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[67]  Toyohide Watanabe,et al.  Toward a practical document understanding of table-form documents: its framework and knowledge representation , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[68]  John E. Hopcroft,et al.  Automatic Discovery of Logical Document Structure , 1998 .

[69]  William DeBuvitz Recording for the blind and dyslexic , 1998 .

[70]  Bertin Klein,et al.  Results of a Study on Invoice-Reading Systems in Germany , 2004, Document Analysis Systems.

[71]  J. F. Arias Efficient techniques for line drawing interpretation and their application to telephone company drawings , 1996 .

[72]  Thomas R. Gruber,et al.  A translation approach to portable ontology specifications , 1993 .

[73]  Dov Dori,et al.  Graphics Recognition Recent Advances , 2001, Lecture Notes in Computer Science.

[74]  David W. Embley,et al.  Object-oriented systems analysis , 1992 .

[75]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[76]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[77]  Yolande Belaïd,et al.  Form Item Extraction Based on Line Searching , 1995, GREC.

[78]  Sanjay Balasubramanian,et al.  Information extraction from telephone company drawings , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Daniel P. Lopresti,et al.  Medium-independent table detection , 1999, Electronic Imaging.

[80]  John C. Handley,et al.  Table analysis for multiline cell identification , 2000, IS&T/SPIE Electronic Imaging.

[81]  Jean-Yves Ramel,et al.  Detection, extraction and representation of tables , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[82]  David W. Embley,et al.  Towards Ontology Generation from Tables , 2005, World Wide Web.

[83]  Veda C. Storey,et al.  Reverse Engineering of Relational Databases: Extraction of an EER Model from a Relational Database , 1994, Data Knowl. Eng..

[84]  Osamu Hori,et al.  Robust table-form structure analysis based on box-driven reasoning , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.