MarkItUp! An Incremental Approach to Document Structure Recognition

SUMMARY This paper presents MarkItUp!, a system to recognize the structure of untagged electronic documents which contain sub‐documents with similar format. For these kinds of documents manual structure recognition is a highly repetitive task. On the other hand, the specification of recognition grammars requires significant intellectual effort. Our approach uses manually structured examples to incrementally generate recognition grammars by means of techniques for learning by example. Users can structure example portions of a document by inserting mark‐ups. MarkItUp! then abstracts and unifies the structure of the examples. On this basis it tries to structure another example with similar format. Users can correct or accept the produced structure. With every accepted example thereby a grammar is acquired and gradually refined, which can be used to successfully structure the other portions of the document.