Semi-Supervised Clustering of XML Documents: Getting the Most from Structural Information

As document providers can express more contextualized and complex information, semi-structured documents are becoming a major source of information in many areas, e.g., in digital libraries, e-commerce or Web applications. A particular characteristic of such document collections is the existence of some structure or metadata along with the data. In this scenario, clustering methods that can take advantage of such structural information to better organize such collections are highly relevant. Semi-structured documents pose new challenges to document clustering methods, however, since it is not clear how this structural information can be used to improve the quality of the generated clustering models. On the other hand, recently there has a growing interest in the semi-supervised clustering task, in which a little amount of prior knowledge is provided to guide the algorithm to a better clustering model. A particular type of semi-supervision is in the form of user-provided constraints defined over pairs of objects, where each pair informs if its objects must be in the same or in different clusters. In this paper, we consider the problem of constrained clustering in documents that present some form of structural information. We consider the existence of a particular form of information to be clustered: textual documents that present a logical structure represented in XML format. We define and extend methods to improve the quality of clustering results by using such structural information to guide the execution of the constrained clustering algorithm. Experimental results on the OHSUMED document collection show the effectiveness of our approach.

[1]  Wesley W. Chu Medical Digital Library to Support Scenario Specific Information Retrieval , 2000, Kyoto International Conference on Digital Libraries.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[4]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[5]  S. S. Ravi,et al.  Clustering with Constraints: Feasibility Issues and the k-Means Algorithm , 2005, SDM.

[6]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[7]  Roger Barga,et al.  Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 3-7 April 2006, Atlanta, GA, USA , 2006, ICDE Workshops.

[8]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[9]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[10]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[11]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[12]  C. M. Sperberg-McQueen,et al.  eXtensible Markup Language (XML) 1.0 (Second Edition) , 2000 .

[13]  Hong Chang,et al.  Locally linear metric adaptation for semi-supervised clustering , 2004, ICML.

[14]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[15]  S. S. Ravi,et al.  Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results , 2005, PKDD.

[16]  Nizar Grira,et al.  Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.