Mining the maintenance history of a legacy software system

A considerable amount of system maintenance experience can be found in bug tracking and source code configuration management systems. Data mining and machine learning techniques allow one to extract models from past experience that can be used in future predictions. By mining the software change record, one can therefore generate models that can be used in future maintenance activities. In this paper, we present an example of such a model that represents a relation between pairs of files and show how it can be extracted from the software update records of a real world legacy system. We show how different sources of data can be used to extract sets of features useful in describing this model, as well as how results are affected by these different feature sets and their combinations. Our best results were obtained from text-based features, i.e. those extracted from words in the problem reports as opposed to syntactic structures in the source code.