Understanding the Working Sets of Data Mining Applications

Data mining applications discover useful information or patterns in large sets of data. Because they can be highly parallelizable and computationally intensive, data mining applications have the potential to take advantage of the large numbers of processors predicted for future multi-core systems. However, the potential performance of these applications on this emerging platform is likely to be impeded by their intensive memory usage. In addition to accessing memory frequently, some of these applications exhibit exceedingly large working set sizes. Storing these large working sets on chip in their entirety may be prohibitively expensive or infeasible as these working set sizes continue to grow with problem size. Greater insight into the characteristics of these working sets is needed in order to determine alternative approaches to storing the entire working set onchip. In this paper, we examine the memory system characteristics of a set of applications from the MineBench data mining suite. We analyze these applications in an architecture independent manner in order to gain greater understanding into the composition of the data working set; in particular, we document the duration and frequency of active and idle periods for working set data. We find that working set data may be reused repeatedly throughout a program’s execution, but each use is for a short period of time. The resulting long idle periods may enable alternate techniques to be used instead of caching in order to obtain good memory performance. We show that for several of these applications, simple prefetching schemes may alleviate the need to cache the large working sets.