Projected clustering with subset selection

It has always been a major challenge to cluster high dimensional data considering the inherent sparsity of data-points. Our model uses attribute selection and handles the sparse structure of the data effectively. The subset section is done by two different methods. In first method, we select the subset which has most informative attributes that do preserve cluster structure using LASSO (Least Absolute Selection and Shrinkage Operator). Though there are other methods for attribute selection, LASSO has distinctive properties that it selects the most correlated set of attributes of the data. In second method, we select the subset of linearly independent attributes using QR factorization. This model also identifies dominant attributes of each cluster which retain their predictive power as well. The quality of the projected clusters formed, is also assured with the use of LASSO.

[1]  Wei Sun,et al.  Regularized k-means clustering of high-dimensional data and its asymptotic consistency , 2012 .

[2]  Y. She Sparse regression with exact clustering , 2008 .

[3]  Christian Callegari,et al.  Advances in Computing, Communications and Informatics (ICACCI) , 2015 .

[4]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[5]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[6]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[7]  U. D. Annakkage,et al.  Prediction of the Transient Stability Boundary Using the Lasso , 2013, IEEE Transactions on Power Systems.

[8]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[9]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[10]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[11]  Mohsen Pourahmadi,et al.  High-Dimensional Covariance Estimation , 2013 .

[12]  Michael K. Ng,et al.  On discovery of extremely low-dimensional clusters using semi-supervised projected clustering , 2005, 21st International Conference on Data Engineering (ICDE'05).

[13]  Mohsen Pourahmadi,et al.  High-Dimensional Covariance Estimation: Pourahmadi/High-Dimensional , 2013 .

[14]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[15]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Kei-Hoi Cheung,et al.  Identifying projected clusters from gene expression profiles , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[17]  Peter J. Bickel,et al.  Selected works of Peter J. Bickel , 2014 .

[18]  Max A. Little,et al.  Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection , 2007, Biomedical engineering online.

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[21]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[22]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[23]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[24]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[25]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[26]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[27]  Shengrui Wang,et al.  Mining Projected Clusters in High-Dimensional Spaces , 2009, IEEE Transactions on Knowledge and Data Engineering.

[28]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[29]  Julien Mairal,et al.  Complexity Analysis of the Lasso Regularization Path , 2012, ICML.

[30]  Yixin Fang,et al.  Asymptotic Equivalence between Cross-Validations and Akaike Information Criteria in Mixed-Effects Models , 2021 .

[31]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[32]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[33]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[34]  Tim Hesterberg,et al.  Least Angle Regression and LASSO for Large Datasets , 2009 .

[35]  Trevor Hastie,et al.  Linear Methods for Regression , 2001 .