Weakly Supervised Label Smoothing

We study Label Smoothing (LS), a widely used regularization technique, in the context of neural learning to rank (L2R) models. LS combines the ground-truth labels with a uniform distribution, encouraging the model to be less confident in its predictions. We analyze the relationship between the non-relevant documents-specifically how they are sampled-and the effectiveness of LS, discussing how LS can be capturing "hidden similarity knowledge" between the relevantand non-relevant document classes. We further analyze LS by testing if a curriculum-learning approach, i.e., starting with LS and after anumber of iterations using only ground-truth labels, is beneficial. Inspired by our investigation of LS in the context of neural L2R models, we propose a novel technique called Weakly Supervised Label Smoothing (WSLS) that takes advantage of the retrieval scores of the negative sampled documents as a weak supervision signal in the process of modifying the ground-truth labels. WSLS is simple to implement, requiring no modification to the neural ranker architecture. Our experiments across three retrieval tasks-passage retrieval, similar question retrieval and conversation response ranking-show that WSLS for pointwise BERT-based rankers leads to consistent effectiveness gains. The source code is available at https://anonymous.4open.science/r/dac85d48-6f71-4261-a7d8-040da6021c52/.

[1]  Nick Craswell,et al.  Learning to Match using Local and Distributed Representations of Text for Web Search , 2016, WWW.

[2]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[3]  W. Bruce Croft,et al.  Neural Ranking Models with Weak Supervision , 2017, SIGIR.

[4]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[5]  Chunyuan Yuan,et al.  Multi-hop Selector Network for Multi-turn Response Selection in Retrieval-based Chatbots , 2019, EMNLP.

[6]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[7]  Hermann Ney,et al.  Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[8]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  W. Bruce Croft,et al.  Learning a Better Negative Sampling Policy with Deep Neural Networks for Search , 2019, ICTIR.

[10]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[11]  Eyal Shnarch,et al.  Will it Blend? Blending Weak and Strong Labeled Data in a Neural Network for Argumentation Mining , 2018, ACL.

[12]  Dongyan Zhao,et al.  Sampling Matters! An Empirical Study of Negative Sampling Strategies for Learning of Matching Models in Retrieval-based Dialogue Systems , 2019, EMNLP.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[15]  Walid Krichene,et al.  On Sampled Metrics for Item Recommendation , 2020, KDD.

[16]  Seong-Whan Lee,et al.  Self-Augmentation: Generalizing Deep Networks to Unseen Classes for Few-Shot Learning , 2020, ArXiv.

[17]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[19]  Yi Xu,et al.  Towards Understanding Label Smoothing , 2020, ArXiv.

[20]  Zhenhua Ling,et al.  Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots , 2020, CIKM.

[21]  Claudia Hauff,et al.  Introducing MANtIS: a novel Multi-Domain Information Seeking Dialogues Dataset , 2019, ArXiv.

[22]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Christian Igel,et al.  Label-similarity Curriculum Learning , 2019, ECCV.

[24]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[25]  Emine Yilmaz,et al.  Document selection methodologies for efficient and effective learning-to-rank , 2009, SIGIR.

[26]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.