论文信息 - Minimax Localization of Structural Information in Large Noisy Matrices

Minimax Localization of Structural Information in Large Noisy Matrices

We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or co-clustering. Despite its great practical relevance, and although several ad-hoc methods are available for biclustering, theoretical analysis of the problem is largely non-existent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including element-wise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition.

[1] Jianhua Z. Huang,et al. Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[2] L. Addario-Berry,et al. On Combinatorial Testing Problems 1 , 2010 .

[3] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[4] Dean P. Foster,et al. A Formal Statistical Approach to Collaborative Filtering , 1998 .

[5] Christopher Krügel,et al. Scalable, Behavior-Based Malware Clustering , 2009, NDSS.

[6] Yi Ma,et al. Robust principal component analysis? , 2009, JACM.

[7] Wei Wang,et al. OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[8] J. Bai,et al. Inferential Theory for Factor Models of Large Dimensions , 2003 .

[9] J. Hartigan. Direct Clustering of a Data Matrix , 1972 .

[10] E. Candès,et al. Searching for a trail of evidence in a maze , 2007, math/0701668.

[11] Shu Wang,et al. Biclustering as a method for RNA local multiple sequence alignment , 2007, Bioinform..

[12] R. Fletcher. Semi-Definite Matrix Constraints in Optimization , 1985 .

[13] L. Lazzeroni. Plaid models for gene expression data , 2000 .

[14] Raj Rao Nadakuditi,et al. The singular values and vectors of low rank perturbations of large rectangular random matrices , 2011, J. Multivar. Anal..

[15] M. Wainwright,et al. High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[16] A. Nobel,et al. On the maximal size of large-average and ANOVA-fit submatrices in a Gaussian random matrix. , 2010, Bernoulli : official journal of the Bernoulli Society for Mathematical Statistics and Probability.

[17] R. Tibshirani,et al. Sparse Principal Component Analysis , 2006 .

[18] Jianhua Z. Huang,et al. Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[19] Huan Liu,et al. Subspace clustering for high dimensional data: a review , 2004, SKDD.

[20] Arlindo L. Oliveira,et al. Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21] R. Tibshirani,et al. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[22] Michael I. Jordan,et al. A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[23] I. Johnstone,et al. On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[24] I. Johnstone. On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[25] D. Donoho,et al. Adaptive multiscale detection of filamentary structures embedded in a background of uniform random points , 2003 .

[26] Panos M. Pardalos,et al. Biclustering in data mining , 2008, Comput. Oper. Res..

[27] G. Stewart. Perturbation theory for the singular value decomposition , 1990 .

[28] S. Szarek,et al. Chapter 8 - Local Operator Theory, Random Matrices and Banach Spaces , 2001 .

[29] Xiaoming Huo,et al. ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES IN A BACKGROUND OF UNIFORM RANDOM POINTS 1 , 2006 .

[30] E. Candès,et al. Detection of an anomalous cluster in a network , 2010, 1001.3209.

[31] R. Rockafellar. The theory of subgradients and its applications to problems of optimization : convex and nonconvex functions , 1981 .

[32] K. H. Kim. The theory of subgradients and its applications to problems of optimization: Convex and nonconvex functions: R.T. Rockafeller, Berlin: Heldermann Verlag, 1981. pp. 107, DM 28.00/$12.00 , 1983 .

[33] Roded Sharan,et al. Biclustering Algorithms: A Survey , 2007 .