Large-scale multiple hypothesis testing in information retrieval: Towards a new approach to document ranking

Information retrieval (IR) may be considered an instance of a common modern statistical problem: a massive simultaneous hypothesis test. Such problems arise often in biostatistics where plentiful data must be winnowed to name a small number of potentially “interesting” cases. For instance, DNA microarray analysis requires researchers to filter thousands of genes, searching for genes implicated in a particular condition. This paper describes a novel approach to IR that is based on the notion of simultaneous hypothesis testing. In this case the test is performed on each document and the null hypothesis is that the document is non-relevant. After a mathematical derivation of the proposed model, we test its performance on three standard data sets against the effectiveness of two baseline IR systems, a vector space model and a language modeling-based system. These preliminary experiments show that the hypothesis testing approach to IR is not only philosophically appealing, but that it also operates at the state of the art in effectiveness.

[1]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[2]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[3]  Jean-Jacques Daudin,et al.  Determination of the differentially expressed genes in microarray experiments using local FDR , 2004, BMC Bioinformatics.

[4]  Weichung Joe Shih,et al.  A mixture model for estimating the local false discovery rate in DNA microarray analysis , 2004, Bioinform..

[5]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[6]  John D. Storey A direct approach to false discovery rates , 2002 .

[7]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[8]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[9]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[12]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .