PYTHIA -

Knowledge Discovery in Databases (KDD) is a new and evolving research area which attempts to solve the knowledge acquisition bottleneck by automatically acquiring knowledge hidden in enormous amounts of data stored in real-life, operational databases. This methodology of inducing knowledge has been applied to a variety of domains where manual inspection was not feasible. In our research, we have adopted, appropriately modiied, and applied this methodology to databases storing performance data related to the solution of scientiic computing applications. It has been argued that scientiic databases lend themselves to automatic machine inspection, since the stored information is of good quality-without missing values or inconsistent data. PYTHIA-II, the system presented in this paper, gives a knowledge engineer the capability of organizing and storing data by seemlessly integrating a powerful DBMS into the knowledge aquisition process. The selection of relevant data for pattern extraction is simpliied by a exible and comprehensive graphical user interface which supports user interaction with a collection of data mining tools. The system uses knowledge structures generated by the mining tools to build a knowledge base which supports the functionality of a recommender system for the scientiic computing domain. In particular, this paper describes the application of the KDD process to a case study involving the evaluation of software for the solution of elliptic Partial Diierential Equations. 1. INTRODUCTION Knowledge Discovery in Databases (KDD) is an emerging, interdisciplinary eld that seeks to uncover hidden information in large, real-life, operational database systems. The phase of the KDD methodology that has attracted the interest of a majority of researchers in this area is Data Mining. During this phase, a data mining algorithm is applied to a target set of data to uncover patterns that will be used in building a model of the underlying domain. Three of the most important issues addressed by this research are: the enormous amount of data that must be processed in a short period of time, the incomplete and \dirty" information contained in these systems and the time varying nature of the data. With respect to the rst issue, researchers seek scalable knowledge discovery methodologies that can be implemented eeciently by parallel systems, as well as advanced techniques that can access data from permanent data stores in a exible and optimal manner. Incomplete and \dirty" data should be handled by the mining algorithms themselves; time varying data calls for incremental updates in the discovered knowledge.