Feature Selection for Social Media Data

Feature selection is widely used in preparing high-dimensional data for effective data mining. The explosive popularity of social media produces massive and high-dimensional data at an unprecedented rate, presenting new challenges to feature selection. Social media data consists of (1) traditional high-dimensional, attribute-value data such as posts, tweets, comments, and images, and (2) linked data that provides social context for posts and describes the relationships between social media users as well as who generates the posts, and so on. The nature of social media also determines that its data is massive, noisy, and incomplete, which exacerbates the already challenging problem of feature selection. In this article, we study a novel feature selection problem of selecting features for social media data with its social context. In detail, we illustrate the differences between attribute-value data and social media data, investigate if linked data can be exploited in a new feature selection framework by taking advantage of social science theories. We design and conduct experiments on datasets from real-world social media Web sites, and the empirical results demonstrate that the proposed framework can significantly improve the performance of feature selection. Further experiments are conducted to evaluate the effects of user--user and user--post relationships manifested in linked data on feature selection, and research issues for future work will be discussed.

[1]  Huan Liu,et al.  Multi-Source Feature Selection via Geometry-Dependent Covariance Analysis , 2008, FSDM.

[2]  Huan Liu,et al.  Discovering Overlapping Groups in Social Media , 2010, 2010 IEEE International Conference on Data Mining.

[3]  Gavin C. Cawley,et al.  Gene Selection in Cancer Classification using Sparse Logistic Regression with Bayesian Regularisation , 2006 .

[4]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[5]  Chris H. Q. Ding,et al.  R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization , 2006, ICML.

[6]  Jennifer Neville,et al.  Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning , 2002, ICML.

[7]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[8]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[9]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[11]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[12]  Huan Liu,et al.  mTrust: discerning multi-faceted trust in a connected world , 2012, WSDM '12.

[13]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2008, IEEE Trans. Knowl. Data Eng..

[14]  Huan Liu,et al.  Integrating Social Media Data for Community Detection , 2011, MSM/MUSE.

[15]  David G. Stork,et al.  Pattern Classification , 1973 .

[16]  Gene H. Golub,et al.  Tikhonov Regularization and Total Least Squares , 1999, SIAM J. Matrix Anal. Appl..

[17]  Jennifer Neville,et al.  Modeling relationship strength in online social networks , 2010, WWW '10.

[18]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[19]  Jennifer Neville,et al.  Using Transactional Information to Predict Link Strength in Online Social Networks , 2009, ICWSM.

[20]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Steven A. Morris,et al.  Manifestation of emerging specialties in journal literature: A growth model of papers, references, exemplars, bibliographic coupling, cocitation, and clustering coefficient distribution: Research Articles , 2005 .

[22]  Kam-Fai Wong,et al.  Interpreting TF-IDF term weights as making relevance decisions , 2008, TOIS.

[23]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[24]  Lei Wang,et al.  Efficient Spectral Feature Selection with Minimum Redundancy , 2010, AAAI.

[25]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[26]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[27]  Jieping Ye,et al.  Multi-Task Feature Learning Via Efficient l2, 1-Norm Minimization , 2009, UAI.

[28]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[29]  Huan Liu,et al.  Scalable learning of collective behavior based on sparse social dimensions , 2009, CIKM.

[30]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[31]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[32]  Noah E. Friedkin,et al.  Network Studies of Social Influence , 1993 .

[33]  Gavin C. Cawley,et al.  Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation , 2006, NIPS.

[34]  P. Abbeel,et al.  Label and Link Prediction in Relational Data , 2003 .

[35]  Steven A. Morris,et al.  Manifestation of emerging specialties in journal literature: A growth model of papers, references, exemplars, bibliographic coupling, cocitation, and clustering coefficient distribution , 2005, J. Assoc. Inf. Sci. Technol..

[36]  Aristidis Likas,et al.  Bayesian feature and model selection for Gaussian mixture models , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[38]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[39]  Volker Roth,et al.  Feature Selection in Clustering Problems , 2003, NIPS.

[40]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[41]  Yaser S. Abu-Mostafa,et al.  Hints , 2018, Neural Computation.

[42]  Huan Liu,et al.  Exploring Social-Historical Ties on Location-Based Social Networks , 2012, ICWSM.

[43]  Jimeng Sun,et al.  MetaFac: community discovery via relational hypergraph factorization , 2009, KDD.