Towards Theoretical Bounds for Resource-bounded Information Gathering for Correlation Clustering

Resource-bounded Information Gathering for Correlation Clustering deals with designing efficient methods for obtaining and incorporating information from external sources to improve accuracy of clustering tasks. In this paper, we formulate the problem, and some specific goals and lay the foundation for better theoretical understanding of this framework. We address the challenging problem of analytically quantifying the effect of changing a single edge weight on the partitioning of the entire graph, under some simplifying assumptions, hence demonstrating a method to calculate the expected reduction in error. Our analysis of different query selection criteria provides a formal way of comparing different heuristics. We compare the solution of our theoretical analysis with simulation results. We also estimate the probability of recovering the true partition under various query selection strategies for general random graphs and discuss some possible directions for approximation. Next, we prove a related bound under certain assumptions. We also describe some general techniques to efficiently query and select nodes for expanding graphs.

[1]  Andrew McCallum,et al.  Resource-Bounded Information Gathering for Correlation Clustering , 2007, COLT.

[2]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[3]  Joachim M. Buhmann,et al.  Active Data Clustering , 1997, NIPS.

[4]  D. Wagner,et al.  How to Cluster Evolving Graphs , 2006 .

[5]  Thorsten Joachims,et al.  Error bounds for correlation clustering , 2005, ICML.

[6]  Koby Crammer,et al.  Online Classification on a Budget , 2003, NIPS.

[7]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[8]  Joseph Naor,et al.  Cut problems in graphs with a budget constraint , 2006, J. Discrete Algorithms.

[9]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[10]  Maria-Florina Balcan,et al.  Margin Based Active Learning , 2007, COLT.

[11]  Luc De Raedt,et al.  Proceedings of the 22nd international conference on Machine learning , 2005 .

[12]  James Aspnes,et al.  Learning Large-Alphabet and Analog Circuits with Value Injection Queries , 2007, COLT.

[13]  Foster J. Provost,et al.  Data acquisition and cost-effective predictive modeling: targeting offers for electronic commerce , 2007, ICEC.

[14]  Xindong Wu,et al.  Data acquisition with active and impact-sensitive instance selection , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[15]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Lise Getoor,et al.  VOILA: Efficient Feature-value Acquisition for Classification , 2007, AAAI.

[17]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[18]  Victor R. Lesser,et al.  BIG: An agent for resource-bounded information gathering and decision making , 2000, Artif. Intell..

[19]  Andrew McCallum,et al.  Efficient Strategies for Improving Partitioning-Based Author Coreference by Incorporating Web Pages as Graph Nodes , 2007 .

[20]  Andreas Krause,et al.  Near-optimal Nonmyopic Value of Information in Graphical Models , 2005, UAI.

[21]  Shlomo Zilberstein,et al.  A Value-Driven System for Autonomous Information Gathering , 2004, Journal of Intelligent Information Systems.

[22]  Anil K. Jain,et al.  Clustering with Soft and Group Constraints , 2004, SSPR/SPR.

[23]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[24]  Richard P. Lippmann,et al.  Proceedings of the 1997 conference on Advances in neural information processing systems 10 , 1990 .

[25]  Joseph Naor,et al.  Cut problems in graphs with a budget constraint , 2007, J. Discrete Algorithms.

[26]  Christopher Joseph Pal,et al.  Improving Author Coreference by Resource-Bounded Information Gathering from the Web , 2007, IJCAI.

[27]  Nicole Immorlica,et al.  Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques , 2003, Lecture Notes in Computer Science.

[28]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Augustine O. Munagi,et al.  Set partitions with successions and separations , 2005, Int. J. Math. Math. Sci..

[30]  A. Mubaidin Jordan , 2010, Practical Neurology.

[31]  Dana Ron,et al.  Property Testing in Bounded Degree Graphs , 1997, STOC.

[32]  Russell Greiner,et al.  Learning and Classifying Under Hard Budgets , 2005, ECML.

[33]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[34]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.