Modern methods for classification analysis involve processes for “learning” to correctly assign elements of a data set to certain classes. In many settings, the learning processes are supervised; i.e. the classes that the training data belong to are known in advance. In many other settings, however, the classes are not known a priori, and a process utilizing unsupervised learning is necessary. We present a novel, two-stage unsupervised learning methodology for the classification problem. Stage one consists of a special clustering method based on a quadratic, unconstrained optimization model that finds optimal classes for the data. Stage two makes use of enhanced mathematical programming models for classifying the data into the optimal classes found during stage one. A significant advantage of our approach, as demonstrated by computational testing, is the ability to yield more meaningful classifications than previously achieved in a variety of settings. We report the outcome of training and testing our method on various data sets from the data mining literature, with specific applications in finance. The comparative results disclose the effectiveness and versatility of the approach, and its merit as a tool for modeling and solving practical problems. Introduction Many classification and discrimination analysis applications involve supervised learning, in which the training data is labeled with the appropriate class definition. In some settings, however, the class definition itself may have been subjective or ambiguous. For example, bond rating agencies such as Moody’s and Standard and Poor each have a proprietary algorithm for rating bonds, which may result in different rating scales, and thus, different assessments of the risk of the same underlying bond. In such instances, it is unclear whether one class definition is better than another. Furthermore, there is a certain amount of subjectivity in the class definition inasmuch as the “experts” evaluating the elements of the different classes may disagree on the relative importance of each of the attributes used as criteria for classification. In order to overcome this problem, we propose a two-stage approach to the classification problem. The first stage clusters the data into “optimal” classes, and the second stage seeks to classify the data correctly into the optimal classes found in stage one. For the purpose of clustering the data, we use the method described in Kochenberger et al (2005), which makes use of a quadratic unconstrained binary quadratic program (UBQP) for clique partitioning. A tabu search (TS) procedure from Glover et al (1999) is used to efficiently solve the UBQP. The classification stage is carried out by a multi-hyperplane mixed integer programming formulation for discrimination analysis, similar to those described in Better et al (2006). The paper is organized as follows: section 1 provides a brief description of our clustering algorithm; section 2 describes a basic multi-hyperplane model for classification of data in two groups; section 3 introduces two examples that use real data in order to illustrate our approach; and section 4 summarizes our results and our conclusions.
[1]
Fred W. Glover,et al.
Clustering of Microarray data via Clique Partitioning
,
2005,
J. Comb. Optim..
[2]
F. Glover,et al.
Tabu Search with Critical Event Memory: An Enhanced Application for Binary Quadratic Programs
,
1999
.
[3]
Rajeev Motwani,et al.
Clique partitions, graph compression and speeding-up algorithms
,
1991,
STOC '91.
[4]
Toshiyuki Sueyoshi,et al.
Extended DEA-Discriminant Analysis
,
2001,
Eur. J. Oper. Res..
[5]
Fred Glover,et al.
IMPROVED LINEAR PROGRAMMING MODELS FOR DISCRIMINANT ANALYSIS
,
1990
.