Automatic Hypertext Construction

The unprecedented growth of the World Wide Web illustrates the importance of hypertext as a method for organizing the rapidly expanding amount of on-line text. As document collections become larger and more dynamic, however, it is not feasible to construct more than an occasional hypertext manually. This thesis presents entirely automatic methods for gathering documents for a hypertext, linking them, and annotating those connections with a description of the type or nature of the link. The problem of automatically collecting related documents is addressed in Chapter 2, where robust Information Retrieval methods are applied to form high-quality links between documents. A local context check identifies links where ambiguous vocabulary erroneously suggests a relationship. Dynamic part retrieval is employed to select the portions of documents which are most related, allowing parts to be linked when it is more appropriate to link subtopics than entire documents. Chapter 3 presents a taxonomy of hypertext link types and defines the following three classes of links: ``pattern-matching'''' links can be found using simple string-matching methods, ``manual'''' links require substantial application of natural language understanding methods (which are currently beyond the state of the art), and ``automatic'''' links are those which can be found using the methods of this thesis. Chapter 4 begins the work of automatic link typing by describing two novel graphical techniques for visualizing the relationship between two or more documents. ``Uniform'''' visuals display the relationship between documents or document parts without regard to their relative sizes, whereas ``varying'''' visuals include information about sizes and locations. Both methods highlight relationships between documents and motivate the automatic techniques of Chapter 5. Chapter 5, thus, demonstrates automatic methods for identifying the relationships depicted in the visualizations. Using an approach based upon graph simplification, this method automatically identifies revision, summary, expansion, equivalence, comparison, contrast, tangential, and aggregate links. Chapter 6 discusses an informal evaluation of the link typing. Though somewhat inconclusive, the evaluation demonstrates that automatic document linking performs well, but also indicates that much work remains to be done toward understanding automatic link typing.