Cost-constrained data acquisition for intelligent data preparation

Real-world data is noisy and can often suffer from corruptions or incomplete values that may impact the models created from the data. To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values. However, due to the significant cost of doing so and the inherent correlations in the data set, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed data set can maximize its performance? In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique economical factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then, we will propose a cost-constrained data acquisition model, where active learning, missing value prediction, and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world data sets demonstrate the effectiveness of our method.

[1]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[2]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[3]  Xindong Wu,et al.  Data acquisition with active and impact-sensitive instance selection , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[4]  Dale Schuurmans,et al.  Learning to classify incomplete examples , 1997, COLT 1997.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Russell Greiner,et al.  Budgeted learning of nailve-bayes classifiers , 2002, UAI 2002.

[7]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[8]  Alen D. Shapiro,et al.  Structured induction in expert systems , 1987 .

[9]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[10]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[11]  Eric R. Ziegel,et al.  Mastering Data Mining , 2001, Technometrics.

[12]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[13]  Nir Friedman,et al.  Learning Belief Networks in the Presence of Missing Values and Hidden Variables , 1997, ICML.

[14]  Ivan Bratko,et al.  Experiments in automatic learning of medical diagnostic rules , 1984 .

[15]  Erich L. Lehmann On likelihood ratio tests , 2006 .

[16]  Alexander Kogan,et al.  Knowing what doesn't Matter: Exploiting the Omission of Irrelevant Data , 1997, Artif. Intell..

[17]  Alex Alves Freitas,et al.  Understanding the Crucial Role of Attribute Interaction in Data Mining , 2001, Artificial Intelligence Review.

[18]  Vincent S. Tseng,et al.  A pre-processing method to deal with missing values by integrating clustering and regression techniques , 2003, Appl. Artif. Intell..

[19]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[20]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[21]  Xindong Wu,et al.  Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets , 2004, AAAI.

[22]  Zhiqiang Zheng,et al.  On active learning for data acquisition , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[24]  Russell Greiner,et al.  Budgeted Learning of Naive-Bayes Classifiers , 2003, UAI.

[25]  Ming Tan,et al.  Cost-Sensitive Learning of Classification Knowledge and Its Applications in Robotics , 1993, Machine Learning.

[26]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[27]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[28]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[29]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[30]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[31]  Patrick Henry Winston,et al.  Learning structural descriptions from examples , 1970 .

[32]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[33]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[34]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[35]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[36]  Marlon Núñez The use of background knowledge in decision tree induction , 2004, Machine Learning.

[37]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[38]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[39]  Ralph Martinez,et al.  Reduction Techniques for Exemplar-Based Learning Algorithms , 1998 .

[40]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[41]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[42]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.