Fundamental Experimental Research in Machine Learning

Fundamental research in machine learning is inherently empirical, because the performance of machine learning algorithms is determined by how well their underlying assumptions match the structure of the world. Hence, no amount of mathematical analysis can determine whether a machine learning algorithm will work well. Experimental studies are required. To understand this point, consider the well-studied problem of supervised learning from examples. This problem is usually stated as follows. An example x i is an n-tuple drawn from some set X according to some xed, unknown probability distribution D. An unknown function f is applied to each example to produce a label y i = f(x i). The labels may be either real-valued quantities (in which case the problem is referred to as a regression problem) or discrete symbols (in which case the problem is referred to as a classiication problem). The goal of machine learning algorithms is to construct an approximation h to the unknown function f such that with high probability, a new example x 2 X drawn according to D will be labeled correctly: h(x) = f(x). For example, consider the problem of diagnosing heart disease. The examples consist of features describing the patient, such as age, sex, whether the patient smokes, blood pressure, results of various laboratory tests, and so forth. The label indicates whether the patient was diagnosed with heart disease. The task of the learning algorithm is to learn a decision-making procedure that will make correct diagnoses for future patients. Learning algorithms work by searching some space of hypotheses, H, for the hypothesis h that is \best" in some sense. Two fundamental questions of machine learning research are (a) what are good hypothesis spaces to search and (b) what deenitions of \best" should be used? For example, a very popular hypothesis space H is the space of decision trees and the deenition of \best" is the hypothesis that minimizes the so-called pessimistic error estimate (Quinlan, 1993). It can be proved that if all unknown functions f are equally likely, then all learning algorithms will have identical performance, regardless of which hypothesis space H they search and which deenition of \best" they employ (Wolpert, 1996; Schaaer, 1994). These so-called \no free lunch" theorems follow from the simple observation that the only information a learning algorithm has is the training examples. And the training examples do not provide any information about the labels of new points …