Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers

Due to both limitations of technologies and the nature of the problems, data mining may not be able to solve a problem completely in a way as one wishes. When this happens, we need to first understand the actual need of business, characteristic of available partial solution, and then make compromises between the technology solution and business needs. A majority of the papers published in data mining conferences and journals seem to concentrate only on the success side of the story. In this paper, we discuss our experiences and the complete process from near failure to success when applying inductive learning techniques to predict non-paying customers of competitive local exchange carriers (CLEC’s), currently at 20%. Experiments with a number of stateof-the-art methods and algorithms found that most customers were labeled as paying on time. Cost-sensitive learning is not possible since the target company cannot define a cost-model. Finally, after discussing with the billing department, a compromised but still useful solution is to predict the probability that someone will default. The billing team can use the predicted score to prioritize collection efforts as well as to predict cash flow. We have found that two randomized decision tree ensemble methods (Fan’s random decision tree and a probabilistic extension of Breiman’s random forest) are consistently more accurate in posterior probability estimation than single decision tree based probability calibration methods. The software, both Fan’s RDT and probabilistic extension of random forest, as well as a longer version of this paper will be made available by the contacting author.

[1]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.