Learning Thematic Similarity Metric from Article Sections Using Triplet Networks

In this paper we suggest to leverage the partition of articles into sections, in order to learn thematic similarity metric between sentences. We assume that a sentence is thematically closer to sentences within its section than to sentences from other sections. Based on this assumption, we use Wikipedia articles to automatically create a large dataset of weakly labeled sentence triplets, composed of a pivot sentence, one sentence from the same section and one from another section. We train a triplet network to embed sentences from the same section closer. To test the performance of the learned embeddings, we create and release a sentence clustering benchmark. We show that the triplet network learns useful thematic metrics, that significantly outperform state-of-the-art semantic similarity methods and multipurpose embeddings on the task of thematic clustering of sentences. We also show that the learned embeddings perform well on the task of sentence semantic similarity prediction.

[1]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[2]  Jaime Carbonell,et al.  Multi-Document Summarization By Sentence Extraction , 2000 .

[3]  W. Bialek,et al.  Information-based clustering. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Elad Yom-Tov,et al.  Parallel Pairwise Clustering , 2009, SDM.

[5]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[6]  Johanna Geiß,et al.  Creating a Gold Standard for Sentence Clustering in Multi-Document Summarization , 2009, ACL.

[7]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[8]  M. Marelli,et al.  SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment , 2014, *SEMEVAL.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[11]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[12]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[14]  Ian D. Reid,et al.  Fast Training of Triplet-Based Deep Binary Embedding Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Paolo Torroni,et al.  Argumentation Mining , 2016, ACM Trans. Internet Techn..

[16]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[18]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[19]  Tao Mei,et al.  Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval , 2016, IJCAI.

[20]  Lei Wang,et al.  Training Triplet Networks with GAN , 2017, ICLR.

[21]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[22]  Noam Slonim,et al.  Unsupervised corpus–wide claim detection , 2017, ArgMining@EMNLP.