Bilingual Segmented Topic Model

This study proposes the bilingual segmented topic model (BiSTM), which hierarchically models documents by treating each document as a set of segments, e.g., sections. While previous bilingual topic models, such as bilingual latent Dirichlet allocation (BiLDA) (Mimno et al., 2009; Ni et al., 2009), consider only cross-lingual alignments between entire documents, the proposed model considers cross-lingual alignments between segments in addition to document-level alignments and assigns the same topic distribution to aligned segments. This study also presents a method for simultaneously inferring latent topics and segmentation boundaries, incorporating unsupervised topic segmentation (Du et al., 2013) into BiSTM. Experimental results show that the proposed model significantly outperforms BiLDA in terms of perplexity and demonstrates improved performance in translation pair extraction (up to +0.083 extraction accuracy).

[1]  ChengXiang Zhai,et al.  Structural Topic Model for Latent Topical Structure Analysis , 2011, ACL.

[2]  Eric P. Xing,et al.  Symmetric Correspondence Topic Models for Multilingual Text Analysis , 2012, NIPS.

[3]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[4]  Marie-Francine Moens,et al.  Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications , 2015, Inf. Process. Manag..

[5]  Sumit Negi,et al.  Mining bilingual topic hierarchies from unaligned text , 2011, IJCNLP.

[6]  Guillaume Wenzek,et al.  Trans-gram, Fast Cross-lingual Word-embeddings , 2015, EMNLP.

[7]  Jian Hu,et al.  Mining multilingual topics from wikipedia , 2009, WWW '09.

[8]  Marcus Hutter,et al.  A Bayesian Review of the Poisson-Dirichlet Process , 2010, ArXiv.

[9]  Jian Hu,et al.  Cross lingual text classification by mining multilingual topics from wikipedia , 2011, WSDM '11.

[10]  Hal Daumé,et al.  Extracting Multilingual Topics from Unaligned Comparable Corpora , 2010, ECIR.

[11]  ChengXiang Zhai,et al.  Cross-Lingual Latent Topic Extraction , 2010, ACL.

[12]  Eric P. Xing,et al.  HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation , 2007, NIPS.

[13]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[14]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[15]  David R. Karger,et al.  Global Models of Document Structure using Latent Permutations , 2009, NAACL.

[16]  Lan Du,et al.  Topic Segmentation with a Structured Topic Model , 2013, NAACL.

[17]  Marie-Francine Moens,et al.  Identifying Word Translations from Comparable Corpora Using Latent Topic Models , 2011, ACL.

[18]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[19]  Xiaodong Liu,et al.  Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus , 2013, CoNLL.

[20]  Marie-Francine Moens,et al.  Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora , 2013, Information Retrieval.

[21]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[22]  Huidong Jin,et al.  Modelling Sequential Text with an Adaptive Topic Model , 2012, EMNLP.

[23]  Huidong Jin,et al.  A segmented topic model based on the two-parameter Poisson-Dirichlet process , 2010, Machine Learning.

[24]  Yoshua Bengio,et al.  BilBOWA: Fast Bilingual Distributed Representations without Word Alignments , 2014, ICML.

[25]  Marie-Francine Moens,et al.  Knowledge Transfer across Multilingual Corpora via Latent Topics , 2011, PAKDD.

[26]  L. C. Hsu,et al.  A Unified Approach to Generalized Stirling Numbers , 1998 .

[27]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[28]  David M. Blei,et al.  Multilingual Topic Models for Unaligned Text , 2009, UAI.

[29]  Lan Du,et al.  Sampling Table Configurations for the Hierarchical Poisson-Dirichlet Process , 2011, ECML/PKDD.

[30]  Eric P. Xing,et al.  BiTAM: Bilingual Topic AdMixture Models for Word Alignment , 2006, ACL.

[31]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[32]  Vladimir Eidelman,et al.  Polylingual Tree-Based Topic Models for Translation Domain Adaptation , 2014, ACL.

[33]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..