Multiple document summarization for written argumentative discourse

My dissertation addresses the problem of automatic summarization of multiple structured texts. I present an algorithm for creating discourse trees implemented with TEI encoded XML. These discourse trees are efficiently combined to form a summarization tree using hierarchical representations of text structure. Using this architecture, intelligent text summarization is possible. My summarization method is completely domain independent and it allows users to compare and contrast related text. My text summarization approach can be embedded into XML-capable browsers, into information retrieval systems, and into information extraction systems to manage different classes of documents, different parts of documents, and different types of information contained in a document. My summarization architecture makes use of three theories of discourse structure to perform multiple document summarization. RST provides the knowledge of a nucleus and a satellite to represent important parts of a structured text. Much of the research in the area of RST makes the assumption that the text contained in the nucleus is more important than the text contained in the satellite. I utilize this inherent characteristic to make decisions about summarization for argumentative text. Specifically, I map argumentative “objects” (proposition, claim, evidence, etc.) to the nucleus articles of RST. I map argumentative “actions” (elaborates, supports, negates, etc.) to the satellite articles of RST. As such, I know that argumentative objects are more important to text summarization than argumentative actions. The theories presented in DMS provide the concept of discourse segment hierarchies. DMS outlines how the specific segment contributes to the overall purpose of the discourse. In my architecture, the purpose for an argument is straightforward (i.e., the author intends to persuade the reader to believe a given proposition.) DMS allows for a hierarchy of segment types within objects and actions to represent the given argument. DMS outlines the theory for the relationship between different segment types. I use the hierarchical concepts and the structural relationships to support my decision to create a tree as the correct representation of written argumentative discourse. Text types are also fundamental to my summarization approach. My algorithm makes the basic assumption that all writers use a particular schema when producing an argument. Text-type theory provides the supporting research for this assumption. In my architecture, I utilize a standard schema for written argumentative discourse to combine documents written by more than one author. I utilize text-type theory to define the overall structure of the schema and I use argumentation theory to define the components of that schema. In my research, I investigated several architectural approaches and develop following: (1) I created a summarization architecture using standardized TEI tags for a specific type of text. I relied on the structure of the underlying text instead of the grammar. Thus, my technique is applicable to other text types described by the TEI. (2) I combined knowledge of argumentative text types to create XML text trees with embedded TEI tags. I determined how to create, manipulate, and analyze the XML trees and demonstrated the flexibility of output available from these XML trees. (3) I combined argumentative text types, XML, and TEI tags to create an architectural model that utilizes industry standards.