What is (tagged) Text

In working on the New OED project, we, like many other researchers, have wrestled with large, intricate bodies of text. Based on this exposure, we have begun to investigate the similarities and differences between managing conventional business data and managing reference text data. The paper begins with the claim that text can support complex models of the real world that cannot be captured more formally. Thus important information resources must be held as text, but the very absence of a formal model makes it difficult to identify the structures present in a text. A common text structuring technique is descriptive markup, which introduces tags into a text stream. We present three views of tagged text: one based on tags as text, one on arbitrarily interleaved tags with text, and one on constrained tag placement in the text. Throughout the discussion, examples are drawn from our experience with the OED. 1. Text as a model The role of a database is to model an enterprise, so that when queries are posed against the database, information can be obtained about the enterprise. Similarly a reference text is consulted to obtain information about aspects of our collective knowledge as modelled by its contents. A reference text database must capture the information of the reference materials, so that it can provide answers to queries for information about the same collective knowledge. Unfortunately working with a reference text database is not as simple as working with a conventional database, because the content is not formally constrained: modelling with text does not distinguish which aspects of perceived reality are captured in the database and which are omitted [Kent78]. Whereas conventional database design begins with a business analysis to determine the users' requirements, followed by a synthesis of a model to capture all the relevant features in a highly structured form, text demands more editorial freedom. Consider the following two definitions from the OED2: