Statistically sound pattern discovery

Pattern discovery is a core data mining activity. Initial approaches were dominated by the frequent pattern discovery paradigm -- only patterns that occur frequently in the data were explored. Having been thoroughly researched and its limitations now well understood, this paradigm is giving way to a new one, which can be called statistically sound pattern discovery. In this paradigm, the main impetus is to discover statistically significant patterns, which are unlikely to have occurred by chance and are likely to hold in future data. Thus, the new paradigm provides a strict control over false discoveries and overfitting. This tutorial covers both classic and cutting-edge research topics on pattern discovery combined to statistical significance testing. We start with an advanced introduction to the relevant forms of statistical significance testing, including different schools and alternative models, their underlying assumptions, practical issues, and limitations. We then discuss their application to data mining specific problems, including evaluation of nested patterns, the multiple testing problem, algorithmic strategies and real-world considerations. We present the current state-of-the art solutions and explore in detail how this approach to pattern discovery can deliver efficient and effective discovery of small sets of interesting patterns.