Information Extraction and Automatic Markup for XML Documents

As XML is going to become the standard document format, there is still the legacy problem of large amounts of text (written in the past as well as today) that are not available in this format. In order to exploit the benefits of XML, these legacy texts must be converted into XML. In this chapter, we discuss the issues of automatic XML markup of documents. We give a survey on existing approaches, and we describe a specific system in some detail.

[1]  Stephen Soderland,et al.  Learning Information Extraction Rules for Semi-Structured and Free Text , 1999, Machine Learning.

[2]  Fabio Ciravegna,et al.  Learning to Tag for Information Extraction from Text , 2000 .

[3]  Arvind Malhotra,et al.  Xml schema part 2: datatypes , 1999 .

[4]  Mary E Califf Relational Learning Techniques for Natural Language Extraction , 1998 .

[5]  Rolf Ingold,et al.  Analysis of synthetic document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[7]  Peter Fankhauser,et al.  MarkItUp! An Incremental Approach to Document Structure Recognition , 1993, Electron. Publ..

[8]  Arturo Crespo,et al.  A Survey Of Semi-Automatic Extraction And Transformation , 1994 .

[9]  Ralph Grishman,et al.  Information Extraction: Techniques and Challenges , 1997, SCIE.

[10]  Douglas E. Appelt,et al.  Introduction to Information Extraction Technology , 1999, IJCAI 1999.

[11]  Richard Southall Visual structure and the transmission of meaning , 1988 .

[12]  Yi Xu,et al.  An incremental approach to document structure recognition , 1998 .

[13]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[14]  Günter Neumann,et al.  A Shallow Text Processing Core Engine , 2002, Comput. Intell..

[15]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[16]  Julie Borsack,et al.  Evaluation of an automatic markup system , 1995, Electronic Imaging.