Improving document clustering using automated machine translation

With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.

[1]  Massih-Reza Amini,et al.  A co-classification approach to learning from multilingual corpora , 2010, Machine Learning.

[2]  Michel Simard,et al.  NRC‘s PORTAGE System for WMT 2007 , 2007, WMT@ACL.

[3]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[4]  Hal Daumé,et al.  A Co-training Approach for Multi-view Spectral Clustering , 2011, ICML.

[5]  Ian Davidson,et al.  Flexible constrained spectral clustering , 2010, KDD.

[6]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[7]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[8]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[9]  Hal Daumé,et al.  Co-regularized Multi-view Spectral Clustering , 2011, NIPS.

[10]  S. S. Ravi,et al.  Identifying and Generating Easy Sets of Constraints for Clustering , 2006, AAAI.

[11]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[12]  Massih-Reza Amini,et al.  Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[13]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[14]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[15]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[16]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Christopher J. C. Burges,et al.  Spectral clustering and transductive learning with multiple views , 2007, ICML '07.

[18]  Harold W. Kuhn,et al.  Nonlinear programming: a historical view , 1982, SMAP.

[19]  Massih-Reza Amini,et al.  Multi-view clustering of multilingual documents , 2010, SIGIR.

[20]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[21]  Mehryar Mohri,et al.  Learning Non-Linear Combinations of Kernels , 2009, NIPS.

[22]  L. Hubert,et al.  Comparing partitions , 1985 .