Generalization of Document Structures and Document Assembly

The accelerating evolution of the World Wide Web has made numerous digital document collections widely available for the public. There is a clear need for new tools that assist the user to gather, combine, and reuse information from existing document collections. On the other hand, the amount of ne-structured documents will enormously increase in the near future, since the Extensible Markup Language (XML) is rapidly gaining popularity in various communities. Compared to HTML, XML makes more versatile processing and customization of documents possible. However, explicit structuring using XML leads to heterogeneously structured document collections, which causes problems when combining and reusing fragments of documents. Document assembly is the computer-aided construction of new documents from existing document collections. Such reuse includes nding relevant document fragments, modifying them as needed, and combining the fragments. This thesis describes a document assembly model based on versatile recognition and manipulation of document fragments that are coherent, contiguous, and relatively independent document parts used as the basis for new assemblies. We also introduce a general document assembly system SAW and a specialized system for tailoring textbooks via the Web. If the assembled documents are to be further processed, the heterogeneous structures of the original documents also have to be uniied. This work presents an element-type classiication method that facilitates uniform processing of heterogeneous structures. The method contains a decision pro-i cedure for mapping an arbitrary structure element to a predeened generic class. The generic classes are deened in a Document Type Deenition (DTD) called generic DTD, which can be seen as a metastructure deenition describing typical logical structures of electronic documents. The element-type classiication extracts information from document instances by inspecting element relations and average text lengths of element instances. In this way various structures, such as hierarchies and element containers wrapping logical units, can be recognized. The method is formally presented by using the concept of grammar morphism. Various practical examples of applying the generic classes are provided, and the method is applied to several well-known public document types. In addition to document assembly, the results of the element-type classiication method can be used, for instance, in automatic generation of stylesheets for structured documents. Acknowledgments First of all, I am grateful to my supervisor, Professor Heikki Mannila, who encouraged me to continue studies as well as inspired and guided my work with this thesis. I am deeply grateful to my mentor, Professor Helena Ahonen. You …