Feature Selection with Linked Data in Social Media

Feature selection is widely used in preparing highdimensional data for eective data mining. Increasingly popular social media data presents new challenges to feature selection. Social media data consists of (1) traditional high-dimensional, attribute-value data such as posts, tweets, comments, and images, and (2) linked data that describes the relationships between social media users as well as who post the posts, etc. The nature of social media also determines that its data is massive, noisy, and incomplete, which exacerbates the already challenging problem of feature selection. In this paper, we illustrate the dierences between attributevalue data and social media data, investigate if linked data can be exploited in a new feature selection framework by taking advantage of social science theories, extensively evaluate the eects of user-user and user-post relationships manifested in linked data on feature selection, and discuss some research issues for future work.

[1]  Chris H. Q. Ding,et al.  R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization , 2006, ICML.

[2]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[3]  Steven A. Morris,et al.  Manifestation of emerging specialties in journal literature: A growth model of papers, references, exemplars, bibliographic coupling, cocitation, and clustering coefficient distribution , 2005, J. Assoc. Inf. Sci. Technol..

[4]  Volker Roth,et al.  Feature Selection in Clustering Problems , 2003, NIPS.

[5]  Huan Liu,et al.  Discovering Overlapping Groups in Social Media , 2010, 2010 IEEE International Conference on Data Mining.

[6]  Jimeng Sun,et al.  MetaFac: community discovery via relational hypergraph factorization , 2009, KDD.

[7]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[8]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[9]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[10]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[11]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[12]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[13]  B. Schölkopf,et al.  Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation , 2007 .

[14]  Jennifer Neville,et al.  Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning , 2002, ICML.

[15]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[16]  Noah E. Friedkin,et al.  Network Studies of Social Influence , 1993 .

[17]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[18]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[20]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[21]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[22]  Huan Liu,et al.  Multi-Source Feature Selection via Geometry-Dependent Covariance Analysis , 2008, FSDM.

[23]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Steven A. Morris,et al.  Manifestation of emerging specialties in journal literature: A growth model of papers, references, exemplars, bibliographic coupling, cocitation, and clustering coefficient distribution: Research Articles , 2005 .

[25]  Zi Huang,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence ℓ2,1-Norm Regularized Discriminative Feature Selection for Unsupervised Learning , 2022 .

[26]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[27]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[28]  Huan Liu,et al.  mTrust: discerning multi-faceted trust in a connected world , 2012, WSDM '12.

[29]  Aristidis Likas,et al.  Bayesian feature and model selection for Gaussian mixture models , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.