DNA visual and analytic data mining

Describes data exploration techniques designed to classify DNA sequences. Several visualization and data mining techniques were used to validate and attempt to discover new methods for distinguishing coding DNA sequences (exons) from non-coding DNA sequences (introns). The goal of the data mining was to see whether some other, possibly non-linear combination of the fundamental position-dependent DNA nucleotide frequency values could be a better predictor than the AMI (average mutual information). We tried many different classification techniques including rule-based classifiers and neural networks. We also used visualization of both the original data and the results of the data mining to help verify patterns and to understand the distinction between the different types of data and classifications. In particular, the visualization helped us develop refinements to neural network classifiers, which have accuracies as high as any known method. Finally, we discuss the interactions between visualization and data mining and suggest an integrated approach.