Active Information Acquisition for Improved Clustering

Many datasets include feature values that are missing but may be acquired at a cost. In this paper, we consider the clustering task for such datasets, and address the problem of acquiring missing feature values that improve clustering quality in a cost-effective manner. Since acquiring all missing information may be unnecessarily expensive, we propose a framework for iteratively selecting feature values that result in highest improvements in clustering quality per unit cost. Our framework can be adapted to different clustering algorithms, and we illustrate it in the context of two popular methods, K-Means and hierarchical agglomerative clustering. Experimental results on several datasets demonstrate clustering accuracy improvements provided by the proposed framework over random acquisition. Additional experiments demonstrate the performance of the framework for different cost structures, and explore several alternative formulations of the acquisition strategy.

[1]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[2]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[5]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[6]  Joachim M. Buhmann,et al.  Active Data Clustering , 1997, NIPS.

[7]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[8]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[9]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[10]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[11]  David Page,et al.  Using Machine Learning to Design and Interpret Gene-Expression Microarrays , 2004, AI Mag..

[12]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[13]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Foster J. Provost,et al.  An expected utility approach to active feature-value acquisition , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Russell Greiner,et al.  Budgeted Learning of Naive-Bayes Classifiers , 2003, UAI.

[17]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[18]  J. Buhmann,et al.  Active learning for hierarchical pairwise data clustering , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[19]  Victor S. Sheng,et al.  Feature value acquisition in testing: a sequential batch test algorithm , 2006, ICML.

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  Dan Klein,et al.  Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach , 2002, ICML.