A Visual Formatting Purpose Representation Language to En- hance Automated Document Classification, Retrieval, and Indexing

Electronic document collections containing documents in machine-readable form lend themselves to attempts at automated indexing and classification. In fact, in many cases the size of these collections renders human indexing infeasible. Yet current automated indexing mechanisms analyze the textual content of documents and fall short of human indexing that can utilize all of the information available in the document, such as its formatting. Recent attempts at improving machine indexing include attempts to incorporate the textual style of documents into the indexing process. Visual cues in documents impart important information to human readers about the document itself and about the purpose of specific segments of text; yet, when a document is reduced to its textual content alone for the purpose of indexing, the cues are lost. For examples, visual cues used by authors that to the human eye immediately distinguish anecdotal segments from crucial elements of the text are unavailable during current forms of document analysis even though this information would be used during human analysis of the document. We present here a Purpose Encoding Document Abstraction Language (PEDAL) for expressing the purpose information contained in formatted text. PEDAL allows formatting to be translated to a uniform representation of its purpose so the information can be used during document analysis, comparison, indexing, and retrieval.