Genetic Algorithms for Reformulation of Large-Scale KDD Problems with Many Irrelevant Attributes

The goal of this research is to apply genetic implementations of algorithms for selection, partitioning, and synthesis of attributes in large-scale data mining problems. Domain knowledge about these operators has been shown to reduce the number of fitness evaluations for candidate attributes. We report results on genetic optimization of attribute selection problems and current work on attribute partitioning, synthesis specifications, and the encoding of domain knowledge about operators in a fitness function. The purpose of this approach is to reduce overfitting in inductive learning and produce more general genetic versions of existing search-based algorithms (or wrappers) for KDD performance tuning [KS98, HG00]. Several GA implementations of alternative attribute synthesis algorithms are applied to concept learning problems in military and commercial KDD applications. One of these, Jenesis, is deployed on several network-of-workstation clusters. It is shown to achieve strongly improved test set accuracy, compared to unwrapped decision tree learning and search-based wrappers [KS98].