Unsupervised Feature Selection for Multi-View Clustering on Text-Image Web News Data

Unlabeled high-dimensional text-image web news data are produced every day, presenting new challenges to unsupervised feature selection on multi-view data. State-of-the-art multi-view unsupervised feature selection methods learn pseudo class labels by spectral analysis, which is sensitive to the choice of similarity metric for each view. For text-image data, the raw text itself contains more discriminative information than similarity graph which loses information during construction, and thus the text feature can be directly used for label learning, avoiding information loss as in spectral analysis. We propose a new multi-view unsupervised feature selection method in which image local learning regularized orthogonal nonnegative matrix factorization is used to learn pseudo labels and simultaneously robust joint $l_{2,1}$-norm minimization is performed to select discriminative features. Cross-view consensus on pseudo labels can be obtained as much as possible. We systematically evaluate the proposed method in multi-view text-image web news datasets. Our extensive experiments on web news datasets crawled from two major US media channels: CNN and FOXNews demonstrate the efficacy of the new method over state-of-the-art multi-view and single-view unsupervised feature selection methods.