An Unsupervised Feature Selection Framework for Social Media Data

The explosive usage of social media produces massive amount of unlabeled and high-dimensional data. Feature selection has been proven to be effective in dealing with high-dimensional data for efficient learning and data mining. Unsupervised feature selection remains a challenging task due to the absence of label information based on which feature relevance is often assessed. The unique characteristics of social media data further complicate the already challenging problem of unsupervised feature selection, e.g., social media data is inherently linked, which makes invalid the independent and identically distributed assumption, bringing about new challenges to unsupervised feature selection algorithms. In this paper, we investigate a novel problem of feature selection for social media data in an unsupervised scenario. In particular, we analyze the differences between social media data and traditional attribute-value data, investigate how the relations extracted from linked data can be exploited to help select relevant features, and propose a novel unsupervised feature selection framework, LUFS, for linked social media data. We systematically design and conduct systemic experiments to evaluate the proposed framework on data sets from real-world social media Web sites. The empirical study demonstrates the effectiveness and potential of our proposed framework.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[3]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[4]  Rynson W. H. Lau,et al.  Knowledge and Data Engineering for e-Learning Special Issue of IEEE Transactions on Knowledge and Data Engineering , 2008 .

[5]  Huan Liu,et al.  Feature Selection for Classification: A Review , 2014, Data Classification: Algorithms and Applications.

[6]  Volker Roth,et al.  Feature Selection in Clustering Problems , 2003, NIPS.

[7]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[8]  Zi Huang,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence ℓ2,1-Norm Regularized Discriminative Feature Selection for Unsupervised Learning , 2022 .

[9]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[10]  David G. Stork,et al.  Pattern Classification , 1973 .

[11]  P. Abbeel,et al.  Label and Link Prediction in Relational Data , 2003 .

[12]  Carla E. Brodley,et al.  Unsupervised Feature Selection Applied to Content-Based Retrieval of Lung Images , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[14]  Huan Liu,et al.  Feature Selection for Clustering: A Review , 2018, Data Clustering: Algorithms and Applications.

[15]  Carla E. Brodley,et al.  Visualization and interactive feature selection for unsupervised data , 2000, KDD '00.

[16]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[18]  Huan Liu,et al.  Feature Selection with Linked Data in Social Media , 2012, SDM.

[19]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[20]  Huan Liu,et al.  Unsupervised feature selection for linked social media data , 2012, KDD.

[21]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[22]  Noah E. Friedkin,et al.  Network Studies of Social Influence , 1993 .

[23]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[24]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[25]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[26]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[27]  Lior Wolf,et al.  Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[28]  Chris H. Q. Ding,et al.  R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization , 2006, ICML.

[29]  Ben Taskar,et al.  Link Prediction in Relational Data , 2003, NIPS.

[30]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[31]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[32]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[33]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[34]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[35]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[36]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[37]  Jiawei Han,et al.  Towards feature selection in network , 2011, CIKM '11.

[38]  Aristidis Likas,et al.  Bayesian feature and model selection for Gaussian mixture models , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[40]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Huan Liu,et al.  Discovering Overlapping Groups in Social Media , 2010, 2010 IEEE International Conference on Data Mining.

[42]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .