We introduce a nonparametric approach to multiscale analysis of document corpora using a hierarchical matrix analysis framework called diffusion wavelets. In contrast to eigenvector methods, diffusion wavelets construct multiscale basis functions. In this framework, a hierarchy is automatically constructed by an iterative series of dilation and orthogonalization steps beginning with an initial set of orthogonal basis functions, such as the unitvector bases. Each set of basis functions at a given level is constructed from the bases at the lower level by dilation using the dyadic powers of a diffusion operator. A novel aspect of our work is that the diffusion analysis is conducted on the space of variables (words), instead of instances (documents). This approach can automatically and efficiently determine the number of levels of the topical hierarchy, as well as the topics at each level. Multiscale analysis of document corpora is achieved by using the projections of the documents onto the spaces spanned by basis functions at different levels. Further, when the input term-term matrix is a "local" diffusion operator, the algorithm runs in time approximately linear in the number of non-zero elements of the matrix. The approach is illustrated on various data sets including NIPS conference papers, 20 Newsgroups and TDT2 data.
[1]
Arthur D. Szlam,et al.
Diffusion wavelet packets
,
2006
.
[2]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[3]
F. Chung.
Laplacians and the Cheeger Inequality for Directed Graphs
,
2005
.
[4]
Sridhar Mahadevan,et al.
Fast direct policy evaluation using multiscale analysis of Markov diffusion processes
,
2006,
ICML.
[5]
T. Landauer,et al.
Indexing by Latent Semantic Analysis
,
1990
.
[6]
Thomas L. Griffiths,et al.
Hierarchical Topic Models and the Nested Chinese Restaurant Process
,
2003,
NIPS.