Data Mining: Understanding Data and Disease Modeling

Analyzing large data sets requires proper understanding of the data in advance. This would help domain experts to influence the data mining process and to properly evaluate the results of a data mining application. In this paper, we introduce an algorithm to identify anomalies in the data. We also propose an approach to include the results of data characteristics checking in a data mining application. The application, reported in this paper, involves developing a disease model from gene expression data using machine learning techniques. We demonstrate how: (i) simple models can be generated from a large set of attributes and (ii) the structure of the models change, when potentially anomalous cases are removed.