Towards Unsupervised Recognition of Semantic Differences in Related Documents

Automatically highlighting words that cause semantic differences between two documents could be useful for a wide range of applications. We formulate recognizing semantic differences (RSD) as a token-level regression task and study three unsupervised approaches that rely on a masked language model. To assess the approaches, we begin with basic English sentences and gradually move to more complex, cross-lingual document pairs. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels. However, all unsupervised approaches still leave a large margin of improvement. Code to reproduce our experiments is available at https://github.com/ZurichNLP/recognizing-semantic-differences

[1]  Marcos Vinícius Treviso,et al.  The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics , 2023, ACL.

[2]  Dong Yu,et al.  How do Words Contribute to Sentence Semantics? Revisiting Sentence Embeddings with a Perturbation Method , 2023, EACL.

[3]  Hwanjo Yu,et al.  Toward Interpretable Semantic Textual Similarity via Optimal Transport-based Contrastive Sentence Learning , 2022, ACL.

[4]  Nikolaos Aletras,et al.  Translation Error Detection as Rationale Extraction , 2021, FINDINGS.

[5]  D. Inkpen,et al.  Detecting Relevant Differences Between Similar Legal Texts , 2022, NLLP.

[6]  Graham Neubig,et al.  Measuring and Increasing Context Usage in Context-Aware Machine Translation , 2021, ACL.

[7]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[8]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[9]  Christoph Leiter Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching Metrics , 2021, EVAL4NLP.

[10]  Ryan Cotterell,et al.  It’s Easier to Translate out of English than into it: Measuring Neural Translation Difficulty by Cross-Mutual Information , 2020, ACL.

[11]  Iryna Gurevych,et al.  Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation , 2020, EMNLP.

[12]  Masoud Jalili Sabet,et al.  SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings , 2020, FINDINGS.

[13]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[14]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[15]  Jason Baldridge,et al.  PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[16]  Timothy Baldwin,et al.  Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation , 2019, ACL.

[17]  Jason Baldridge,et al.  PAWS: Paraphrase Adversaries from Word Scrambling , 2019, NAACL.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Eneko Agirre,et al.  Interpretable Semantic Textual Similarity: Finding and explaining differences between sentences , 2016, Knowl. Based Syst..

[20]  Eneko Agirre,et al.  SemEval-2016 Task 2: Interpretable Semantic Textual Similarity , 2016, *SEMEVAL.

[21]  Marcello Federico,et al.  Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents , 2012, ACL.

[22]  Marko Robnik-Sikonja,et al.  Explaining Classifications For Individual Instances , 2008, IEEE Transactions on Knowledge and Data Engineering.