Detecting important subgroups and rare classes in numerical data sets

This thesis focuses on two problems in the area of machine learning. The first is the well-known subgroup discovery problem where the goal is to identify statistically interesting subgroups related to target attributes. Attributes must be discretized during the subgroup discovery process. We describe an algorithm for the discretization of continuous target attributes. The algorithm identifies patterns in the target data and uses them to select the discretization cutpoints. We use the algorithm in a new subgroup discovery method that utilizes a novel quality function to evaluate the interestingness of subgroups. Tests show that the discretization method leads to improved insight. We also define a new data mining problem that identifies members of a rare class of data using one given instance of the rare class. We call this the needles-in-haystack problem. Members of a rare class of data, the needles, have been hidden in a set of records, the haystack. The only information regarding the characterization of the rare class is a single instance of a needle. It is assumed that members of the needle class are similar to each other according to an unknown needle characterization. The goal is to find the needle records hidden in the haystack. This thesis describes an effective algorithm for that task.