Structure in on-line documents

We present a hierarchical approach for extracting homogeneous regions in on-line documents. The problem of identifying and processing ruled and unruled tables, text and drawings is addressed. The on-line document is first segmented into regions with only text strokes and regions with both text and non-text strokes. The text region is further classified as unruled table or plain text. Stroke clustering is used to segment the non-text regions. Each nontext segment is then classified as drawing, ruled table or underlined keyword using stroke properties. The individual regions are processed and the results are assembled to identify the structure of the on-line document.

[1]  Y. Hirayama,et al.  A method for table structure analysis using DP matching , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[2]  Anil K. Jain,et al.  Document Representation and Its Application to Page Decomposition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  William Kornfeld,et al.  Automatically locating, extracting and analyzing tabular data , 1998, SIGIR '98.

[4]  Anil K. Jain,et al.  Learning Prototypes for On-Line Handwritten Digits , 1998 .

[5]  Anil K. Jain,et al.  Locating text in complex color images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[6]  Thomas Zimmerman,et al.  Pen computing: challenges and applications , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[7]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Anil K. Jain,et al.  A robust and fast skew detection algorithm for generic documents , 1996, Pattern Recognit..

[9]  Anil K. Jain,et al.  Learning prototypes for online handwritten digits , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[10]  Philip A. Chou,et al.  AN ITERATIVE DECODING APPROACH TO DOCUMENT IMAGE ANALYSIS , 1999 .

[11]  Thomas Kieninger,et al.  THE T-RECS APPROACH FOR TABLE STRUCTURE RECOGNITION AND TABLE BORDER DETERMINATION , 1999 .

[12]  Zhixin Shi,et al.  A natural learning algorithm based on Hough transform for text lines extraction in handwritten documents , 1999 .

[13]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[14]  Elisabetta Bruzzone,et al.  An algorithm for extracting cursive text lines , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).