Unsupervised Feature Selection for Text Data

Feature selection for unsupervised tasks is particularly challenging, especially when dealing with text data. The increase in online documents and email communication creates a need for tools that can operate without the supervision of the user. In this paper we look at novel feature selection techniques that address this need. A distributional similarity measure from information theory is applied to measure feature utility. This utility informs the search for both representative and diverse features in two complementary ways: Cluster divides the entire feature space, before then selecting one feature to represent each cluster; and Greedy increments the feature subset size by a greedily selected feature. In particular we found that Greedy's local search is suited to learning smaller feature subset sizes while Cluster is able to improve the global quality of larger feature sets. Experiments with four email data sets show significant improvement in retrieval accuracy with nearest neighbour based search methods compared to an existing frequency-based method. Importantly both Greedy and Cluster make significant progress towards the upper bound performance set by a standard supervised feature selection method.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[3]  Sarah Zelikovitz Mining for Features to Improve Classification , 2003, MLMTA.

[4]  Naftali Tishby,et al.  The Power of Word Clusters for Text Classification , 2006 .

[5]  Ivan Koychev,et al.  Feature Selection and Generalisation for Retrieval of Textual Cases , 2004, ECCBR.

[6]  Mario Lenz Defining Knowledge Layers for Textual Case-Based Reasoning , 1998, EWCBR.

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  Stewart Massie,et al.  Index Driven Selective Sampling for CBR , 2003, ICCBR.

[9]  Lillian Lee,et al.  On the effectiveness of the skew divergence for statistical language analysis , 2001, AISTATS.

[10]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[11]  Luís Torgo,et al.  Knowledge Discovery in Databases: PKDD 2005, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, PKDD.

[12]  Sutanu Chakraborti,et al.  A Propositional Approach to Textual Case Indexing , 2005, PKDD.

[13]  Daniel Barbará,et al.  Categorization and keyword identification of unlabeled documents , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[14]  Barry Smyth,et al.  Advances in Case-Based Reasoning , 1996, Lecture Notes in Computer Science.

[15]  Susan Craw,et al.  Genetic Algorithms to Optimise CBR Retrieval , 2000, EWCBR.

[16]  Padraig Cunningham,et al.  Generating Estimates of Classification Confidence for a Case-Based Spam Filter , 2005, ICCBR.

[17]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[18]  David D. Lewis,et al.  Threading Electronic Mail - A Preliminary Study , 1997, Inf. Process. Manag..

[19]  Kevin D. Ashley,et al.  Textual case-based reasoning , 2005, Knowl. Eng. Rev..

[20]  Padraig Cunningham,et al.  An Analysis of Case-Base Editing in a Spam Filtering System , 2004, ECCBR.

[21]  Kalyan Moy Gupta,et al.  Towards Acquiring Case Indexing Taxonomies From Text , 2004, FLAIRS Conference.

[22]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[23]  Kevin D. Ashley,et al.  The Role of Information Extraction for Textual CBR , 2001, ICCBR.

[24]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[25]  Mykola Galushka,et al.  Sophia: A novel approach for Textual Case-based Reasoning , 2005, IJCAI.

[26]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[27]  Luc Lamontagne,et al.  Textual Reuse for Email Response , 2004, ECCBR.

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Barry Smyth,et al.  Building Compact Competent Case-Bases , 1999, ICCBR.

[30]  Luc Lamontagne,et al.  Case-Based Reasoning Research and Development , 1997, Lecture Notes in Computer Science.