From Legacy Documents to XML: A Conversion Framework

We present an integrated framework for the document conversion from legacy formats to XML format. We describe the LegDoC project, aimed at automating the conversion of layout annotations layout-oriented formats like PDF, PS and HTML to semantic-oriented annotations. A toolkit of different components covers complementary techniques the logical document analysis and semantic annotations with the methods of machine learning. We use a real case conversion project as a driving example to exemplify different techniques implemented in the project.

[1]  Martti Penttonen,et al.  Towards automating of document structure transformations , 2002, DocEng '02.

[2]  James Clark,et al.  XSL Transformations (XSLT) Version 1.0 , 1999 .

[3]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[4]  Donato Malerba,et al.  Transforming paper documents into XML format with WISDOM++ , 2001, International Journal on Document Analysis and Recognition.

[5]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[6]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[7]  James R. Curran,et al.  Transformation-Based Learning for Automatic Translation from HTML to XML , 1999 .

[8]  Liangrui Peng,et al.  Hierarchical logical structure extraction of book documents by analyzing tables of contents , 2003, IS&T/SPIE Electronic Imaging.

[9]  Steven J. Simske,et al.  Automatic document navigation for digital content remastering , 2003, IS&T/SPIE Electronic Imaging.

[10]  I. V. Ramakrishnan,et al.  Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis , 2003, SEMWEB.

[11]  Frank Lebourgeois,et al.  Document understanding using probabilistic relaxation: application on tables of contents of periodicals , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[12]  Robert M. Haralick,et al.  Recursive X-Y cut using bounding boxes of connected components , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[13]  Yasuto Ishitani,et al.  Document transformation system from papers to XML data based on pivot XML document method , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[14]  Michael Gertz,et al.  Reverse engineering for Web data: from visual to semantic structures , 2002, Proceedings 18th International Conference on Data Engineering.

[15]  Xiaofan Lin Text-mining based journal splitting , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[16]  Lukasz A. Kurgan,et al.  Semantic Mapping of XML Tags Using Inductive Machine Learning , 2002, ICMLA.