Applying the Waek Learning Framework to Understand and Improve C4.5

There has long been a chasm between theoretical models of machine learning and practical machine learning algorithms. For instance, empirically successful algorithms such as C4:5 and backpropagation have not met the criteria of the PAC model and its variants. Conversely, the algorithms suggested by computational learning theory are usually too limited in various ways to nd wide application. The theoretical status of decision tree learning algorithms is a case in point: while it has been proven that C4:5 (and all reasonable variants of it) fails to meet the PAC model criteria [2], other recently proposed decision tree algorithms that do have non-trivial performance guarantees unfortunately require membership queries [6, 13]. Two recent developments have narrowed this gap between theory and practice|not for the PAC model, but for the related model known as weak learning or boosting . First, an algorithm called Adaboost was proposed that meets the formal criteria of the boosting model and is also competitive in practice [10]. Second, the basic algorithms underlying the popular C4:5 and CART programs have also very recently been shown to meet the formal criteria of the boosting model [12]. Thus, it seems plausible that the weak learning framework may provide a setting for interaction between formal analysis and machine learning practice that is lacking in other theoretical models. Our aim in this paper is to push this interaction further in light of these recent developments. In particular, we perform experiments suggested by the formal results for Adaboost and C4:5 within the weak learning framework. We concentrate on two particularly intriguing issues. First, the theoretical boosting results for top-down decision tree algorithms such as C4:5 [12] suggest that a new splitting criterion may result in trees that are smaller and more accurate than those obtained using the usual information gain. We con rm this suggestion experimentally. Second, a super cial interpretation of the theoretical results suggests that Adaboost should vastly outperform C4:5. This is not the case in practice, and we argue through experimental results that the theory must be understood in terms of a measure of a boosting algorithm's behavior called its advantage sequence. We compare the advantage sequences for C4:5 and Adaboost in a number of experiments. We nd that these sequences have qualitatively different behavior that explains in large part the discrepancies between empirical performance and the theoretical results. Brie y, we nd that although C4:5 and Adaboost are both boosting algorithms, Adaboost creates successively \harder" ltered distributions, while C4:5 creates successively \easier" ones, in a sense that will be made precise.