Term Filtering with Bounded Error

In this paper, we consider a novel problem referred to as term filtering with bounded error to reduce the term (feature) space by eliminating terms without (or with bounded) information loss. Different from existing works, the obtained term space provides a complete view of the original term space. More interestingly, several important questions can be answered such as: 1) how different terms interact with each other and 2) how the filtered terms can be represented by the other terms. We perform a theoretical investigation of the term filtering problem and link it to the Geometric Covering By Discs problem, and prove its NP-hardness. We present two novel approaches for both loss less and lossy term filtering with bounds on the introduced error. Experimental results on multiple text mining tasks validate the effectiveness of the proposed approaches.

[1]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[2]  David S. Johnson The NP-Completeness Column: An Ongoing Guide , 1986, J. Algorithms.

[3]  Ruoming Jin,et al.  A Topic Modeling Approach and Its Integration into the Random Walk Framework for Academic Search , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Nisheeth Shrivastava,et al.  Graph summarization with bounded error , 2008, SIGMOD Conference.

[5]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[6]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[7]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[8]  Yiming Yang,et al.  Using Corpus Statistics to Remove Redundant Words in Text Categorization , 1996, J. Am. Soc. Inf. Sci..

[9]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[10]  Mohammad Al Hasan,et al.  Clustering with Lower Bound on Similarity , 2009, PAKDD.

[11]  Jie Tang,et al.  ArnetMiner: extraction and mining of academic social networks , 2008, KDD.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[14]  Maosong Sun,et al.  Scalable Term Selection for Text Categorization , 2007, EMNLP.

[15]  C. J. van Rijsbergen,et al.  Investigating the relationship between language model perplexity and IR precision-recall measures , 2003, SIGIR.

[16]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[19]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[20]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[21]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[22]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[23]  Naftali Tishby,et al.  Most informative dimension reduction , 2002, AAAI/IAAI.

[24]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[25]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[26]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[27]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[28]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[29]  Max Welling,et al.  Distributed Inference for Latent Dirichlet Allocation , 2007, NIPS.

[30]  Yin Yang,et al.  Query by document , 2009, WSDM '09.

[31]  Bing Liu,et al.  Web Page Cleaning for Web Mining through Feature Weighting , 2003, IJCAI.

[32]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[33]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[34]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[35]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[36]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.