Towards Theoretical Understanding of Weak Supervision for Information Retrieval

Neural network approaches have recently shown to be effective in several information retrieval (IR) tasks. However, neural approaches often require large volumes of training data to perform effectively, which is not always available. To mitigate the shortage of labeled data, training neural IR models with weak supervision has been recently proposed and received considerable attention in the literature. In weak supervision, an existing model automatically generates labels for a large set of unlabeled data, and a machine learning model is further trained on the generated "weak" data. Surprisingly, it has been shown in prior art that the trained neural model can outperform the weak labeler by a significant margin. Although these obtained improvements have been intuitively justified in previous work, the literature still lacks theoretical justification for the observed empirical findings. In this position paper, we propose to theoretically study weak supervision, in particular for IR tasks, e.g., learning to rank. We briefly review a set of our recent theoretical findings that shed light on learning from weakly supervised data, and provide guidelines on how train learning to rank models with weak supervision.

[1]  W. Bruce Croft,et al.  Relevance-based Word Embedding , 2017, SIGIR.

[2]  James Allan,et al.  Universal Approximation Functions for Fast Learning to Rank: Replacing Expensive Regression Forests with Simple Feed-Forward Networks , 2018, SIGIR.

[3]  W. Bruce Croft,et al.  On the Theory of Weak Supervision for Information Retrieval , 2018, ICTIR.

[4]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[5]  Jaap Kamps,et al.  Avoiding Your Teacher's Mistakes: Training Neural Networks with Controlled Weak Supervision , 2017, ArXiv.

[6]  M. de Rijke,et al.  Weakly-supervised Contextualization of Knowledge Graph Facts , 2018, SIGIR.

[7]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[8]  J. Shane Culpepper,et al.  Neural Query Performance Prediction with Weak Supervision , 2018, SIGIR 2018.

[9]  W. Bruce Croft,et al.  Neural Ranking Models with Weak Supervision , 2017, SIGIR.

[10]  Maarten de Rijke,et al.  Share your Model instead of your Data: Privacy Preserving Mimic Learning for Ranking , 2017, ArXiv.

[11]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[12]  Yiqun Liu,et al.  Training Deep Ranking Model with Weak Relevance Labels , 2017, ADC.

[13]  Yelong Shen,et al.  Deep Context Modeling for Web Query Entity Disambiguation , 2017, CIKM.

[14]  Yi Fang,et al.  Deep Semantic Text Hashing with Weak Supervision , 2018, SIGIR.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Jian-Yun Nie,et al.  Multi-level Abstraction Convolutional Model with Weak Supervision for Information Retrieval , 2018, SIGIR.

[17]  W. Bruce Croft,et al.  aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model , 2016, CIKM.

[18]  Hamed Zamani,et al.  Situational Context for Ranking in Personal Search , 2017, WWW.