Optimal screening and discovery of sparse signals with applications to multistage high throughput studies

Summary A common feature in large-scale scientific studies is that signals are sparse and it is desirable to narrow down significantly the focus to a much smaller subset in a sequential manner. We consider two related data screening problems: one is to find the smallest subset such that it virtually contains all signals and another is to find the largest subset such that it essentially contains only signals. These screening problems are closely connected to but distinct from the more conventional signal detection or multiple-testing problems. We develop phase transition diagrams to characterize the fundamental limits in simultaneous inference and derive data-driven screening procedures which control the error rates with near optimality properties. Applications in the context of multistage high throughput studies are discussed.

[1]  Sanat K. Sarkar,et al.  FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES , 2004 .

[2]  J. Bartroff Asymptotically optimal multistage tests of simple hypotheses , 2007, 0712.0721.

[3]  T. Cai,et al.  Estimating the Null and the Proportion of Nonnull Effects in Large-Scale Multiple Comparisons , 2006, math/0611108.

[4]  Michael H. Hecht,et al.  A Novel Inhibitor of Amyloid β (Aβ) Peptide Aggregation , 2012, The Journal of Biological Chemistry.

[5]  Aideen Long,et al.  Statistical methods for analysis of high-throughput RNA interference screens , 2009, Nature Methods.

[6]  D. Yekutieli Hierarchical False Discovery Rate–Controlling Methodology , 2008 .

[7]  Yihong Wu,et al.  Optimal Detection of Sparse Mixtures Against a Given Null Distribution , 2014, IEEE Transactions on Information Theory.

[8]  Jiashun Jin,et al.  Optimal detection of heterogeneous and heteroscedastic mixtures , 2011 .

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  N. Meinshausen Hierarchical testing of variable importance , 2008 .

[11]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[12]  Y. Benjamini,et al.  False Discovery Rates for Spatial Signals , 2007 .

[13]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[14]  Robert D. Nowak,et al.  Distilled Sensing: Adaptive Sampling for Sparse Detection and Estimation , 2010, IEEE Transactions on Information Theory.

[15]  Jiashun Jin,et al.  Estimation and Confidence Sets for Sparse Normal Mixtures , 2006, math/0612623.

[16]  D. Donoho,et al.  Asymptotic Minimaxity Of False Discovery Rate Thresholding For Sparse Exponential Data , 2006, math/0602311.

[17]  Xiaohua Douglas Zhang Optimal High-Throughput Screening: Practical Experimental Design and Data Analysis for Genome-Scale RNAi Research , 2011 .

[18]  S. Fleischer,et al.  Spatial and Temporal Dynamics of Colorado Potato Beetle (Coleoptera: Chrysomelidae) in Fields with Perimeter and Spatially Targeted Insecticides , 2002 .

[19]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[20]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[21]  Wenguang Sun,et al.  Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control , 2007 .

[22]  D. Y. Lin Evaluating Statistical Significance in Two-Stage Genomewide Association Studies , 2006 .

[23]  Jiashun Jin Proportion of non‐zero normal means: universal oracle equivalences and uniformly consistent estimators , 2008 .

[24]  C. Begg,et al.  Two‐Stage Designs for Gene–Disease Association Studies with Sample Size Constraints , 2004, Biometrics.

[25]  A. Abate,et al.  Ultrahigh-throughput screening in drop-based microfluidics for directed evolution , 2010, Proceedings of the National Academy of Sciences.

[26]  Xin Wang,et al.  Tree‐structured gatekeeping tests in clinical trials with hierarchically ordered multiple objectives , 2007, Statistics in medicine.

[27]  Hans-Joachim Böhm,et al.  A guide to drug discovery: Hit and lead generation: beyond high-throughput screening , 2003, Nature Reviews Drug Discovery.

[28]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[29]  Jiashun Jin,et al.  UPS delivers optimal phase diagram in high-dimensional variable selection , 2010, 1010.5028.

[30]  D. Geman,et al.  Hierarchical testing designs for pattern recognition , 2005, math/0507421.

[31]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[32]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[33]  Peter Müller,et al.  Sequential stopping for high-throughput experiments , 2013, Biostatistics.

[34]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[35]  Tze Leung Lai Sequential multiple hypothesis testing and efficient fault detection-isolation in stochastic systems , 2000, IEEE Trans. Inf. Theory.

[36]  Alan Dove,et al.  Screening for content—the evolution of high throughput , 2003, Nature Biotechnology.

[37]  Jarvis Haupt,et al.  Adaptive Sensing for Sparse Signal Recovery , 2009, 2009 IEEE 13th Digital Signal Processing Workshop and 5th IEEE Signal Processing Education Workshop.

[38]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[39]  N. Meinshausen,et al.  Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses , 2005, math/0501289.

[40]  John D. Storey A direct approach to false discovery rates , 2002 .

[41]  P. Bauer,et al.  Optimized multi‐stage designs controlling the false discovery or the family‐wise error rate , 2008, Statistics in medicine.

[42]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[43]  Laurent Briollais,et al.  Sequential Design for Microarray Experiments , 2005 .

[44]  Robert Tibshirani,et al.  The 'miss rate' for the analysis of gene expression data. , 2005, Biostatistics.

[45]  Ulrich Mansmann,et al.  Multiple testing on the directed acyclic graph of gene ontology , 2008, Bioinform..

[46]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[47]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[48]  Martin Posch,et al.  Hunting for Significance With the False Discovery Rate , 2009 .

[49]  P. Müller,et al.  Optimal Sample Size for Multiple Testing , 2004 .

[50]  J. Goeman,et al.  The Sequential Rejection Principle of Familywise Error Control , 2010, 1211.3313.

[51]  Robert Nadon,et al.  Statistical practice in high-throughput screening data analysis , 2006, Nature Biotechnology.