Topic Models for Unsupervised Cluster Matching

We propose topic models for unsupervised cluster matching, which is the task of finding matching between clusters in different domains without correspondence information. For example, the proposed model finds correspondence between document clusters in English and German without alignment information, such as dictionaries and parallel sentences/documents. The proposed model assumes that documents in all languages have a common latent topic structure, and there are potentially infinite number of topic proportion vectors in a latent topic space that is shared by all languages. Each document is generated using one of the topic proportion vectors and language-specific word distributions. By inferring a topic proportion vector used for each document, we can allocate documents in different languages into common clusters, where each cluster is associated with a topic proportion vector. Documents assigned into the same cluster are considered to be matched. We develop an efficient inference procedure for the proposed model based on collapsed Gibbs sampling. The effectiveness of the proposed model is demonstrated with real data sets including multilingual corpora of Wikipedia and product reviews.

[1]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[2]  Tomoharu Iwata,et al.  Unsupervised Many-to-Many Object Matching for Relational Data , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Arto Klami Bayesian object matching , 2013, Machine Learning.

[4]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[5]  Matthew T. Harrison,et al.  A simple example of Dirichlet process mixture inconsistency for the number of components , 2013, NIPS.

[6]  Qiang Yang,et al.  Transfer learning for collaborative filtering via a rating-matrix generative model , 2009, ICML '09.

[7]  Sami Virpioja,et al.  Bilingual sentence matching using Kernel CCA , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[8]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[9]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[10]  Naonori Ueda,et al.  Unsupervised Cluster Matching via Probabilistic Latent Variable Models , 2013, AAAI.

[11]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[12]  Li Fei-Fei,et al.  Spatially coherent latent topic model for concurrent object segmentation and classification , 2007 .

[13]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[14]  Masashi Sugiyama,et al.  Cross-Domain Object Matching with Model Selection , 2011, AISTATS.

[15]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[16]  Le Song,et al.  Kernelized Sorting , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[18]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[19]  Fei-Fei Li,et al.  Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  ChengXiang Zhai,et al.  Cross-Lingual Latent Topic Extraction , 2010, ACL.

[21]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[22]  Thomas Hofmann,et al.  Collaborative filtering via gaussian probabilistic latent semantic analysis , 2003, SIGIR.

[23]  Arto Klami Variational Bayesian Matching , 2012, ACML.

[24]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[25]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[26]  Slobodan Vucetic,et al.  Convex Kernelized Sorting , 2012, AAAI.

[27]  Tao Zhang,et al.  Cross Lingual Entity Linking with Bilingual Topic Model , 2013, IJCAI.

[28]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[30]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..