Iterative Denoising using Jensen-Renyi Divergences with an Application to Unsupervised Document Categorization

Iterative denoising trees were used by Karakos et al. (2005) for unsupervised hierarchical clustering. The tree construction involves projecting the data onto low-dimensional spaces, as a means of smoothing their empirical distributions, as well as splitting each node based on an information-theoretic maximization objective. In this paper, we improve upon the work of (Karakos et al., 2005) in two ways: (i) the amount of computation spent searching for a good projection at each node now adapts to the intrinsic dimensionality of the data observed at that node; (ii) the objective at each node is to find a split which maximizes a generalized form of mutual information, the Jensen-Renyi divergence; this is followed by an iterative Naive Bayes classification. The single parameter α of the Jensen-Renyi divergence is chosen based on the "strapping" methodology, which learns a meta-classifier on a related task. Compared with the sequential information bottleneck method, our procedure produces state-of-the-art results on an unsupervised categorization task of documents from the "20 Newsgroups" dataset.

[1]  Philip A. Chou,et al.  Optimal Partitioning for Classification and Regression Trees , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  David J. Marchette,et al.  Integrated sensing and processing decision trees , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Imre Csiszár Generalized cutoff rates and Renyi's information measures , 1995, IEEE Trans. Inf. Theory.

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[6]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[9]  H. Krim,et al.  Jensen-renyi divergence measure: theoretical and computational perspectives , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[10]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[11]  Carey E. Priebe,et al.  Unsupervised classification via decision trees: an information-theoretic perspective , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[13]  Alfred O. Hero,et al.  Feature coincidence trees for registration of ultrasound breast images , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[14]  Damianos Karakos,et al.  Bootstrapping Without the Boot , 2005, HLT.

[15]  N. S. Barnett,et al.  Private communication , 1969 .

[16]  Philip J. Hayes,et al.  Guest Editorial - Special Issue on Text Categorization , 1994, ACM Trans. Inf. Syst..

[17]  Alessandro Giua,et al.  Guest Editorial , 2001, Discrete event dynamic systems.

[18]  Yun He,et al.  A generalized divergence measure for robust image registration , 2003, IEEE Trans. Signal Process..

[19]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[20]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .