Improving Neural Cross-Lingual Abstractive Summarization via Employing Optimal Transport Distance for Knowledge Distillation

Current state-of-the-art cross-lingual summarization models employ multi-task learning paradigm, which works on a shared vocabulary module and relies on the self-attention mechanism to attend among tokens in two languages. However, correlation learned by self-attention is often loose and implicit, inefficient in capturing crucial cross-lingual representations between languages. The matter worsens when performing on languages with separate morphological or structural features, making the cross-lingual alignment more challenging, resulting in the performance drop. To overcome this problem, we propose a novel Knowledge-Distillation-based framework for Cross-Lingual Summarization, seeking to explicitly construct cross-lingual correlation by distilling the knowledge of the monolingual summarization teacher into the cross-lingual summarization student. Since the representations of the teacher and the student lie on two different vector spaces, we further propose a Knowledge Distillation loss using Sinkhorn Divergence, an Optimal-Transport distance, to estimate the discrepancy between those teacher and student representations. Due to the intuitively geometric nature of Sinkhorn Divergence, the student model can productively learn to align its produced cross-lingual hidden states with monolingual hidden states, hence leading to a strong correlation between distant languages. Experiments on cross-lingual summarization datasets in pairs of distant languages demonstrate that our method outperforms state-of-the-art models under both high and low-resourced settings.

[1]  Anh Tuan Luu,et al.  Contrastive Learning for Neural Topic Model , 2021, NeurIPS.

[2]  Tho Quan,et al.  Enriching and Controlling Global Semantics for Text Summarization , 2021, EMNLP.

[3]  Yvette Graham,et al.  Improving Unsupervised Question Answering via Summarization-Informed Question Generation , 2021, EMNLP.

[4]  Chung-Wei Hang,et al.  Improved Text Classification via Contrastive Adversarial Training , 2021, AAAI.

[5]  Heyan Huang,et al.  Cross-Lingual Abstractive Summarization with Limited Parallel Resources , 2021, ACL.

[6]  Mingxuan Wang,et al.  Contrastive Learning for Many-to-many Multilingual Neural Machine Translation , 2021, ACL.

[7]  S. Friedland,et al.  Quantum Optimal Transport , 2021, 2105.06922.

[8]  Nanyun Peng,et al.  Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training , 2021, EMNLP.

[9]  Hua Wu,et al.  Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing , 2021, EMNLP.

[10]  Bin Bi,et al.  VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation , 2020, ACL.

[11]  Claire Cardie,et al.  WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization , 2020, FINDINGS.

[12]  Ming-Ming Cheng,et al.  Interactive Knowledge Distillation , 2020, ArXiv.

[13]  Xiaojun Wan,et al.  Jointly Learning to Align and Summarize for Neural Cross-Lingual Summarization , 2020, ACL.

[14]  Jiajun Zhang,et al.  Attend, Translate and Summarize: An Efficient Method for Neural Cross-Lingual Summarization , 2020, ACL.

[15]  Tiejun Zhao,et al.  Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation , 2020, ACL.

[16]  Georgios Tzimiropoulos,et al.  Knowledge distillation via adaptive instance normalization , 2020, ArXiv.

[17]  E. Laber,et al.  Speeding up Word Mover's Distance and its variants via properties of distances between embeddings , 2019, ECAI.

[18]  G. Peyr'e,et al.  Sinkhorn Divergences for Unbalanced Optimal Transport , 2019, ArXiv.

[19]  Regina Barzilay,et al.  Capturing Greater Context for Question Generation , 2019, AAAI.

[20]  Ming Gong,et al.  Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System , 2019, WSDM.

[21]  Jiajun Zhang,et al.  NCLS: Neural Cross-Lingual Summarization , 2019, EMNLP.

[22]  Kathleen McKeown,et al.  A Robust Abstractive System for Cross-Lingual Summarization , 2019, NAACL.

[23]  Mitesh M. Khapra,et al.  On Knowledge distillation from complex networks for response prediction , 2019, NAACL.

[24]  Marco Cuturi,et al.  Computational Optimal Transport: With Applications to Data Science , 2019 .

[25]  Di He,et al.  Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.

[26]  Zhe Gan,et al.  Improving Sequence-to-Sequence Learning via Optimal Transport , 2019, ICLR.

[27]  Jörg Tiedemann,et al.  What Do Language Representations Really Represent? , 2019, Computational Linguistics.

[28]  Che-Rung Lee,et al.  Knowledge Distillation with Feature Maps for Image Classification , 2018, ACCV.

[29]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[30]  Tommi S. Jaakkola,et al.  Gromov-Wasserstein Alignment of Word Embedding Spaces , 2018, EMNLP.

[31]  Nan Yang,et al.  Attention-Guided Answer Distillation for Machine Reading Comprehension , 2018, EMNLP.

[32]  Yu Zhou,et al.  Abstractive Cross-Language Summarization via Translation Model Enhanced Predicate Argument Structure Fusing , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[34]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[35]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Adam M. Oberman,et al.  NUMERICAL METHODS FOR MATCHING FOR TEAMS AND WASSERSTEIN BARYCENTERS , 2014, 1411.3602.

[37]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[38]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[39]  Xiaojun Wan,et al.  Using Bilingual Information for Cross-Language Document Summarization , 2011, ACL.

[40]  Xiaojun Wan,et al.  Cross-Language Document Summarization Based on Machine Translation Quality Prediction , 2010, ACL.

[41]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[42]  Constantin Orasan,et al.  Evaluation of a Cross-lingual Romanian-English Multi-document Summariser , 2008, LREC.

[43]  Kevin E. Bassler,et al.  Optimal transport on complex networks , 2007 .

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[45]  Jong-Hyeok Lee,et al.  Multi-Document Summarization Using Cross-Language Texts , 2004, NTCIR.

[46]  Yann Brenier,et al.  A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem , 2000, Numerische Mathematik.

[47]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .