On the Theory of Weak Supervision for Information Retrieval

Neural network approaches have recently shown to be effective in several information retrieval (IR) tasks. However, neural approaches often require large volumes of training data to perform effectively, which is not always available. To mitigate the shortage of labeled data, training neural IR models with weak supervision has been recently proposed and received considerable attention in the literature. In weak supervision, an existing model automatically generates labels for a large set of unlabeled data, and a machine learning model is further trained on the generated "weak" data. Surprisingly, it has been shown in prior art that the trained neural model can outperform the weak labeler by a significant margin. Although these obtained improvements have been intuitively justified in previous work, the literature still lacks theoretical justification for the observed empirical findings. In this paper, we provide a theoretical insight into weak supervision for information retrieval, focusing on learning to rank. We model the weak supervision signal as a noisy channel that introduces noise to the correct ranking. Based on the risk minimization framework, we prove that given some sufficient constraints on the loss function, weak supervision is equivalent to supervised learning under uniform noise. We also find an upper bound for the empirical risk of weak supervision in case of non-uniform noise. Following the recent work on using multiple weak supervision signals to learn more accurate models, we find an information theoretic lower bound on the number of weak supervision signals required to guarantee an upper bound for the pairwise error probability. We empirically verify a set of presented theoretical findings, using synthetic and real weak supervision data.

[1]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[2]  Bernhard Schölkopf,et al.  Fidelity-Weighted Learning , 2017, ICLR.

[3]  Jian-Yun Nie,et al.  Multi-level Abstraction Convolutional Model with Weak Supervision for Information Retrieval , 2018, SIGIR.

[4]  W. Bruce Croft,et al.  Relevance-based Word Embedding , 2017, SIGIR.

[5]  Hamed Zamani,et al.  Situational Context for Ranking in Personal Search , 2017, WWW.

[6]  W. Bruce Croft,et al.  Neural Ranking Models with Weak Supervision , 2017, SIGIR.

[7]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[8]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[9]  J. Shane Culpepper,et al.  Neural Query Performance Prediction using Weak Supervision from Multiple Signals , 2018, SIGIR.

[10]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[11]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data , 2010, J. Mach. Learn. Res..

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  M. de Rijke,et al.  Weakly-supervised Contextualization of Knowledge Graph Facts , 2018, SIGIR.

[16]  Anima Anandkumar,et al.  Learning From Noisy Singly-labeled Data , 2017, ICLR.

[17]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[18]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[19]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[20]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[21]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[22]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[23]  Jaap Kamps,et al.  Avoiding Your Teacher's Mistakes: Training Neural Networks with Controlled Weak Supervision , 2017, ArXiv.

[24]  Charles L. A. Clarke,et al.  Efficient and effective spam filtering and re-ranking for large web datasets , 2010, Information Retrieval.

[25]  M. de Rijke,et al.  Building simulated queries for known-item topics: an analysis using six european languages , 2007, SIGIR.

[26]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[27]  Fernando Diaz,et al.  SIGIR 2018 Workshop on Learning from Limited or Noisy Data for Information Retrieval , 2018, SIGIR.

[28]  Jacob Goldberger,et al.  Training deep neural-networks based on unreliable labels , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Thorsten Joachims,et al.  Unbiased Learning-to-Rank with Biased Feedback , 2016, WSDM.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[32]  W. Bruce Croft,et al.  A language modeling approach to information retrieval , 1998, SIGIR '98.

[33]  Jimmy J. Lin,et al.  Pseudo test collections for learning web search ranking functions , 2011, SIGIR.

[34]  Aritra Ghosh,et al.  Robust Loss Functions under Label Noise for Deep Neural Networks , 2017, AAAI.

[35]  W. Bruce Croft,et al.  A Hybrid Embedding Approach to Noisy Answer Passage Retrieval , 2018, ECIR.

[36]  Aritra Ghosh,et al.  Making risk minimization tolerant to label noise , 2014, Neurocomputing.