Unsupervised Streaming Feature Selection in Social Media

The explosive growth of social media sites brings about massive amounts of high-dimensional data. Feature selection is effective in preparing high-dimensional data for data analytics. The characteristics of social media present novel challenges for feature selection. First, social media data is not fully structured and its features are usually not predefined, but are generated dynamically. For example, in Twitter, slang words (features) are created everyday and quickly become popular within a short period of time. It is hard to directly apply traditional batch-mode feature selection methods to find such features. Second, given the nature of social media, label information is costly to collect. It exacerbates the problem of feature selection without knowing feature relevance. On the other hand, opportunities are also unequivocally present with additional data sources; for example, link information is ubiquitous in social media and could be helpful in selecting relevant features. In this paper, we study a novel problem to conduct unsupervised streaming feature selection for social media data. We investigate how to exploit link information in streaming feature selection, resulting in a novel unsupervised streaming feature selection framework USFS. Experimental results on two real-world social media datasets show the effectiveness and efficiency of the proposed framework comparing with the state-of-the-art unsupervised feature selection algorithms.

[1]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[2]  Jaideep Srivastava,et al.  Predicting trusts among users of online communities: an epinions case study , 2008, EC '08.

[3]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[4]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[5]  M. McPherson,et al.  Birds of a Feather: Homophily in Social Networks , 2001 .

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  James Theiler,et al.  Online Feature Selection using Grafting , 2003, ICML.

[8]  Huan Xu,et al.  Streaming Sparse Principal Component Analysis , 2015, ICML.

[9]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[10]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[11]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Michael J. Freedman,et al.  Scalable Inference of Overlapping Communities , 2012, NIPS.

[14]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[15]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[16]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[17]  Jianping Zeng,et al.  Text stream clustering algorithm based on adaptive feature selection , 2011, Expert Syst. Appl..

[18]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[19]  Jian Pei,et al.  SNOC: Streaming Network Node Classification , 2014, 2014 IEEE International Conference on Data Mining.

[20]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[21]  K. Selçuk Candan,et al.  GI-NMF: Group Incremental Non-Negative Matrix Factorization on Data Streams , 2014, CIKM.

[22]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[24]  K. Selçuk Candan,et al.  LWI-SVD: low-rank, windowed, incremental singular value decompositions on time-evolving data sets , 2014, KDD.

[25]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[26]  Hao Wang,et al.  Online Streaming Feature Selection , 2010, ICML.

[27]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[28]  Jing Liu,et al.  Unsupervised Feature Selection Using Nonnegative Spectral Analysis , 2012, AAAI.

[29]  Huan Liu,et al.  Unsupervised feature selection for linked social media data , 2012, KDD.

[30]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Jing Zhou,et al.  Streaming feature selection using alpha-investing , 2005, KDD '05.

[32]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[33]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[34]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[35]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[36]  Jing Wang,et al.  Online Group Feature Selection , 2013, IJCAI.

[37]  Feiping Nie,et al.  Trace Ratio Criterion for Feature Selection , 2008, AAAI.