Semantic Clustering

Appropriate clustering of objects into pages in secondary memory is crucial to achieving good performance in a persistent object store. We present a new approach, termed semantic clustering, that exploits more of a program’s data accessing semantics than previous proposals. We insulate the source code from changes in clustering, so that clustering only impacts performance. The linguistic constructs used to specify semantic clustering are illustrated with an example of two tools with quite different access patterns. Experimentation with this example indicates that, for the tools, object sizes, and hardware configuration considered here, performing any clustering at all yields an order of magnitude improvement in overall tool execution time over pure page faulting, and that semantic clustering is faster than other forms of clustering by 20%–35%, and within 25% of the (unattainable) optimal clustering. The most salient aspect of a tightly coupled persistent object store is that it blurs the distinction between data stored in main memory and data resident on secondary storage. Objects are accessed in a program using such an object store with little or no regard to where the object actually resides [Balch et al. 1989]. If in fact the object has not been cached in main memory, the first access to the object results in an object fault, in which the object is read in from disk and made available for access. Generally, objects are clustered on disk into segments, and an object fault transfers an entire segment from disk to main memory. We don’t consider here objects whose size is greater than the smallest segment, in part because such objects won’t benefit from any clustering scheme. In this paper we present a new approach to clustering that exploits more of a program’s data accessing semantics than previous proposals. This approach retains the user’s lack of concern for whether an object is cached in main memory, while significantly increasing the performance of the program by simultaneously reducing CPU overhead and disk I/O time. The next section introduces the tradeoffs inherent in clustering and summarizes previous approaches. We present an overview of our approach, termed semantic clustering, in Section 2, with a detailed example appearing in Section 3. Section 4 presents the results of experiments that indicate several performance advantages to semantic clustering. The last section briefly examines how we plan to put this approach into practice in a fairly large programming environment. 1 Implementing Object Faulting The data model supported by a persistent object store is a (potentially very large) collection of objects, each containing uninterpreted data along with references to other objects. Programs start with a designated root object, traverse some of the embedded references, and make changes to some of the objects encountered. When the program commits, all changes become visible to other programs that use the object store. The runtime system is responsible for moving objects between main memory and secondary storage, and for converting between alternative representations. To the program, all objects are equally accessible; it is the runtime library’s responsibility to maintain this fiction in the presence of disparate main memory and disk access speeds. There are three policies the runtime system must implement. First, how should objects be grouped into segments? Second, when should each object or segment be transferred to or from disk? And third, when should the representation of each object be converted from external form to internal form, and vice versa?