Linear-Time Rule Induction

The recent emergence of data mining as a major application of machine learning has led to increased interest in fast rule induction algorithms. These are able to efficiently process large numbers of examples, under the constraint of still achieving good accuracy. If e is the number of examples, many rule learners have O(e4) asymptotic time complexity in noisy domains, and C4.5RULES has been empirically observed to sometimes require O(e3). Recent advances have brought this bound down to O(elog2 e), while maintaining accuracy at the level of C4.5RULES's. In this paper we present CWS, a new algorithm with guaranteed O(e) complexity, and verify that it outperforms C4.5RULES and CN2 in time, accuracy and output size on two large datasets. For example, on NASA's space shuttle database, running time is reduced from over a month (for C4.5RULES) to a few hours, with a slight gain in accuracy. CWS is based on interleaving the induction of all the rules and evaluating performance globally instead of locally (i.e., it uses a "conquering without separating" strategy as opposed to a "separate and conquer" one). Its bias is appropriate to domains where the underlying concept is simple and the data is plentiful but noisy.