A Hierarchical Model of Web Summaries

We investigate the relevance of hierarchical topic models to represent the content of Web gists. We focus our attention on DMOZ, a popular Web directory, and propose two algorithms to infer such a model from its manually-curated hierarchy of categories. Our first approach, based on information-theoretic grounds, uses an algorithm similar to recursive feature selection. Our second approach is fully Bayesian and derived from the more general model, hierarchical LDA. We evaluate the performance of both models against a flat 1-gram baseline and show improvements in terms of perplexity over held-out data.

[1]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Marius Pasca,et al.  Latent Variable Models of Concept-Attribute Attachment , 2009, ACL/IJCNLP.

[3]  Bernadette Bouchon-Meunier,et al.  Enhanced web document summarization using hyperlinks , 2003, HYPERTEXT '03.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[6]  Thomas Hofmann,et al.  The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data , 1999, IJCAI.

[7]  Qiang Yang,et al.  Web-page summarization using clickthrough data , 2005, SIGIR '05.

[8]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[9]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[10]  Wei Li,et al.  Nonparametric Bayes Pachinko Allocation , 2007, UAI.

[11]  Vibhu O. Mittal,et al.  OCELOT: a system for summarizing Web pages , 2000, SIGIR '00.

[12]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[13]  Eduard H. Hovy,et al.  The Automated Acquisition of Topic Signatures for Text Summarization , 2000, COLING.

[14]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[15]  C. Elkan,et al.  Topic Models , 2008 .