Randomization Techniques for Data Mining Methods

Data mining research has concentrated on inventing novel methods for finding interesting information from large masses of data. This has indeed led to many new computational tasks and some interesting algorithmic developments. However, there has been less emphasis on issues of significance testing of the discovered patterns or models. We discuss the issues in testing the results of data mining methods, and review some of the recent work in the development of scalable algorithmic techniques for randomization tests for data mining methods. We consider suitable null models and generation algorithms for randomization of 0-1 -matrices, arbitrary real valued matrices, and segmentations. We also discuss randomization for database queries.