Topic regression

Text documents are generally accompanied by non-textual information, such as authors, dates, publication sources, and, increasingly, automatically recognized named entities. Work in text analysis has often involved predicting these non-text values based on text data for tasks such as document classification and author identification. This thesis considers the opposite problem: predicting the textual content of documents based on non-text data. In this work I study several regression-based methods for estimating the influence of specific metadata elements in determining the content of text documents. Such topic regression methods allow users of document collections to test hypotheses about the underlying environments that produced those documents.

[1]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Richard C. Lindrooth,et al.  Dirichlet-Multinomial Regression , 2005 .

[3]  Andrew McCallum,et al.  Expertise modeling for matching papers with reviewers , 2007, KDD '07.

[4]  Andrew McCallum,et al.  Group and Topic Discovery from Relations and Their Attributes , 2005, NIPS.

[5]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[6]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[7]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[8]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[9]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[10]  Andrew McCallum,et al.  Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors , 2008 .

[11]  H. Künsch Gaussian Markov random fields , 1979 .

[12]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[13]  Pieter C. N. Groenewald,et al.  Bayesian computation for logistic regression , 2005, Comput. Stat. Data Anal..

[14]  S. Chib,et al.  Bayesian analysis of binary and polychotomous response data , 1993 .

[15]  James F. Spriggs,et al.  Crafting Law on the Supreme Court: The Collegial Game , 2000 .

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[18]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[19]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks , 2005, IJCAI.

[20]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[21]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[22]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[23]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[24]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[25]  Wray L. Buntine Estimating Likelihoods for Topic Models , 2009, ACML.

[26]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[27]  Leonhard Held,et al.  Improved auxiliary mixture sampling for hierarchical models of non-Gaussian data , 2009, Stat. Comput..

[28]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[30]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[31]  Forrest Maltzman,et al.  Opinion Assignment on the Rehnquist Court , 2005 .

[32]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[33]  Darrell Laham,et al.  From paragraph to graph: Latent semantic analysis for information visualization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[35]  Luc Devroye,et al.  Random variate generation in one line of code , 1996, Proceedings Winter Simulation Conference.

[36]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[37]  Padhraic Smyth,et al.  Statistical entity-topic models , 2006, KDD '06.

[38]  Forrest Maltzman,et al.  Agenda Control, the Median Justice, and the Majority Opinion on the U.S. Supreme Court , 2007 .