Continuum directions for supervised dimension reduction

Abstract Dimension reduction of multivariate data supervised by auxiliary information is considered. A series of basis for dimension reduction is obtained as minimizers of a novel criterion. The proposed method is akin to continuum regression, and the resulting basis is called continuum directions. With a presence of binary supervision data, these directions continuously bridge the principal component, mean difference and linear discriminant directions, thus ranging from unsupervised to fully supervised dimension reduction. High-dimensional asymptotic studies of continuum directions for binary supervision reveal several interesting facts. The conditions under which the sample continuum directions are inconsistent, but their classification performance is good, are specified. While the proposed method can be directly used for binary and multi-category classification, its generalizations to incorporate any form of auxiliary data are also presented. The proposed method enjoys fast computation, and the performance is better or on par with more computer-intensive alternatives.

[1]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Jeongyoun Ahn,et al.  Covariance adjustment for batch effect in gene expression data , 2014, Statistics in medicine.

[3]  R. Cook,et al.  Sufficient Dimension Reduction via Inverse Regression , 2005 .

[4]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[5]  Yang Feng,et al.  A road to classification in high dimensional space: the regularized optimal affine discriminant , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[6]  J. S. Marron,et al.  Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA , 2012, J. Multivar. Anal..

[7]  Inge S. Helland,et al.  Envelopes and partial least squares regression , 2013 .

[8]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[9]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[10]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[11]  Makoto Aoshima,et al.  PCA Consistency for Non-Gaussian Data in High Dimension, Low Sample Size Context , 2009 .

[12]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[13]  J. Friedman,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Response , 1993 .

[14]  Scott A. Rifkin,et al.  Revealing the architecture of gene regulation: the promise of eQTL studies. , 2008, Trends in genetics : TIG.

[15]  Jeongyoun Ahn,et al.  CLUSTERING HIGH DIMENSION, LOW SAMPLE SIZE DATA USING THE MAXIMAL DATA PILING DISTANCE , 2012 .

[16]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[17]  Pradeep Ravikumar,et al.  A General Framework for Mixed Graphical Models , 2014, 1411.0288.

[18]  T. Cai,et al.  A Direct Estimation Approach to Sparse Linear Discriminant Analysis , 2011, 1107.3442.

[19]  Walter A. Korfmacher,et al.  Investigation of matrix effects in bioanalytical high-performance liquid chromatography/tandem mass spectrometric assays: application to drug discovery. , 2003, Rapid communications in mass spectrometry : RCM.

[20]  James Stephen Marron,et al.  High dimension low sample size asymptotics of robust PCA , 2015 .

[21]  Bing Li,et al.  ENVELOPE MODELS FOR PARSIMONIOUS AND EFFICIENT MULTIVARIATE LINEAR REGRESSION , 2010 .

[22]  Svante Wold,et al.  Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection , 1996 .

[23]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[24]  A. U.S.,et al.  Partial Envelopes for Efficient Estimation in Multivariate Linear Regression , 2010 .

[25]  Jianqing Fan,et al.  PROJECTED PRINCIPAL COMPONENT ANALYSIS IN FACTOR MODELS. , 2014, Annals of statistics.

[26]  Hao Helen Zhang,et al.  Weighted Distance Weighted Discrimination and Its Asymptotic Properties , 2010, Journal of the American Statistical Association.

[27]  R. Morgan Genetics and molecular biology. , 1995, Current opinion in lipidology.

[28]  J. S. Marron,et al.  Distance-Weighted Discrimination , 2007 .

[29]  Steven J. M. Jones,et al.  Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer , 2015, Cell.

[30]  Yilin Mo,et al.  Penalized Fisher discriminant analysis and its application to image-based morphometry , 2011, Pattern Recognit. Lett..

[31]  Igor Jurisica,et al.  Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study , 2008, Nature Medicine.

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[34]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[35]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[36]  Yongho Jeon,et al.  HDLSS Discrimination With Adaptive Data Piling , 2013 .

[37]  Robert F Murphy,et al.  Deformation‐based nuclear morphometry: Capturing nuclear shape variation in HeLa cells , 2008, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[38]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[39]  Anders Björkström,et al.  A Generalized View on Continuum Regression , 1999 .

[40]  Wei Wang,et al.  Detection and classification of thyroid follicular lesions based on nuclear structure from histopathology images , 2010, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[41]  J. Shao,et al.  Sparse linear discriminant analysis by thresholding for high dimensional data , 2011, 1105.3561.

[42]  Xihong Lin,et al.  Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection , 2009, Bioinform..

[43]  Trevor J. Hastie,et al.  Sparse Discriminant Analysis , 2011, Technometrics.

[44]  J. Marron,et al.  The high-dimension, low-sample-size geometric representation holds under mild conditions , 2007 .

[45]  B. Caffo,et al.  MULTILEVEL FUNCTIONAL PRINCIPAL COMPONENT ANALYSIS. , 2009, The annals of applied statistics.

[46]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[47]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[48]  I. Jolliffe Principal Component Analysis , 2002 .

[49]  Sijmen de Jong,et al.  Extending the relationship between ridge regression and continuum regression , 1994 .

[50]  Martin J. Aryee,et al.  Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in Rheumatoid Arthritis , 2013, Nature Biotechnology.

[51]  Gustavo K. Rohde,et al.  An Optimal Transportation Approach for Nuclear Structure-Based Pathology , 2011, IEEE Transactions on Medical Imaging.

[52]  R. Sundberg Continuum Regression and Ridge Regression , 1993 .

[53]  R. Dennis Cook,et al.  Scaled envelopes: scale-invariant and efficient estimation in multivariate linear regression , 2013 .

[54]  Haipeng Shen,et al.  A survey of high dimension low sample size asymptotics , 2018, Australian & New Zealand journal of statistics.

[55]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[56]  O. Linton,et al.  EFFICIENT SEMIPARAMETRIC ESTIMATION OF THE FAMA-FRENCH MODEL AND EXTENSIONS , 2012 .

[57]  Andrew B. Nobel,et al.  Supervised singular value decomposition and its asymptotic properties , 2016, J. Multivar. Anal..

[58]  A. Izenman Reduced-rank regression for the multivariate linear model , 1975 .

[59]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[60]  M. Tso Reduced‐Rank Regression and Canonical Analysis , 1981 .

[61]  C. Bock Analysing and interpreting DNA methylation data , 2012, Nature Reviews Genetics.

[62]  R. Dennis Cook,et al.  Inner envelopes: Efficient estimation in multivariate linear regression , 2012 .

[63]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[64]  James Stephen Marron,et al.  Distance‐weighted discrimination , 2015 .

[65]  J. Marron,et al.  The maximal data piling direction for discrimination , 2010 .

[66]  Mike Halsey,et al.  An Introduction to IT Support , 2019, The IT Support Handbook.