Unsupervised feature selection for large data sets

Abstract The last decade saw a considerable increase in the availability of data. Unfortunately, this increase was overshadowed by various technical difficulties that arise when analysing large data sets. These include long processing times, large requirements for data storage, and other technical issues related to the analysis of high-dimensional data sets. By consequence, reducing the cardinality of data sets (with minimum information loss) has become of interest to virtually any data scientist. Many feature selection algorithms have been introduced in the literature, however, there are two main issues with these. First, the vast majority of such algorithms require labelled samples to learn from. One should note it is often too expensive to label a meaningful amount of data, particularly when dealing with large data sets. Second, these algorithms were not designed to deal with the volume of data we have nowadays. This paper introduces a novel unsupervised feature selection algorithm designed specifically to deal with large data sets. Our experiments demonstrate the superiority of our method.

[1]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[2]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Mirkin Boris,et al.  Clustering: A Data Recovery Approach , 2012 .

[6]  Renato Cordeiro de Amorim,et al.  Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering , 2012, Pattern Recognit..

[7]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[10]  Renato Cordeiro de Amorim,et al.  A Survey on Feature Weighting Based K-Means Algorithms , 2015, Journal of Classification.

[11]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[12]  Douglas Steinley,et al.  K-means clustering: a half-century synthesis. , 2006, The British journal of mathematical and statistical psychology.

[13]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[14]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[15]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[16]  J. Alberto Espinosa,et al.  Big Data: Issues and Challenges Moving Forward , 2013, 2013 46th Hawaii International Conference on System Sciences.

[17]  Renato Cordeiro de Amorim,et al.  Feature weighting as a tool for unsupervised feature selection , 2018, Inf. Process. Lett..

[18]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[19]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[20]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .