Bayesian text analytics for document collections

Modern document collections are too large to annotate and curate manually. As increasingly large amounts of data become available, historians, librarians and other scholars increasingly need to rely on automated systems to efficiently and accurately analyze the contents of their collections and to find new and interesting patterns therein. Modern techniques in Bayesian text analytics are becoming wide spread and have the potential to revolutionize the way that research is conducted. Much work has been done in the document modeling community towards this end, though most of it is focussed on modern, relatively clean text data. We present research for improved modeling of document collections that may contain textual noise or that may include real-valued metadata associated with the documents. This class of documents includes many historical document collections. Indeed, our specific motivation for this work is to help improve the modeling of historical documents, which are often noisy and/or have historical context represented by metadata. Many historical documents are digitized by means of Optical Character Recognition (OCR) from document images of old and degraded original documents. Historical documents also often include associated metadata, such as timestamps, which can be incorporated in an analysis of their topical content. Many techniques, such as topic models, have been developed to automatically discover patterns of meaning in large collections of text. While these methods are useful, they can break down in the presence of OCR errors. We show the extent to which this performance breakdown occurs. The specific types of analyses covered in this dissertation are document clustering, feature selection, unsupervised and supervised topic modeling for documents with and without OCR errors and a new supervised topic model that uses Bayesian nonparametrics to improve the modeling of document metadata. We present results in each of these areas, with an emphasis on studying the effects of noise on the performance of the algorithms and on modeling the metadata associated with the documents. In this research we effectively: improve the state of the art in both document clustering and topic modeling; introduce a useful synthetic dataset for historical document researchers; and present analyses that empirically show how existing algorithms break down in the presence of OCR errors.

[1]  Inderjit S. Dhillon,et al.  Iterative clustering of high dimensional text data augmented by local search , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[2]  Elaine Toms,et al.  The effect of speech recognition accuracy rates on the usefulness and usability of webcast archives , 2006, CHI.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  Shourya Roy,et al.  How Much Noise Is Too Much: A Study in Automatic Text Classification , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Sean Gerrish,et al.  Predicting Legislative Roll Calls from Text , 2011, ICML.

[6]  M. Stephens Dealing with label switching in mixture models , 2000 .

[7]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[8]  L. Hubert,et al.  Comparing partitions , 1985 .

[9]  Christian P. Robert,et al.  The Bayesian choice : from decision-theoretic foundations to computational implementation , 2007 .

[10]  D. Aldous Exchangeability and related topics , 1985 .

[11]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[12]  Eric K. Ringger,et al.  Topics Over Nonparametric Time: A Supervised Topic Model Using Bayesian Nonparametric Density Estimation , 2012, BMA.

[13]  Dan Klein,et al.  Unsupervised Coreference Resolution in a Nonparametric Bayesian Model , 2007, ACL.

[14]  David M. Blei,et al.  Uncovering, understanding, and predicting links , 2011 .

[15]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[16]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[17]  Daniel P. Lopresti Optical character recognition errors and their effects on natural language processing , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[18]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[19]  Charles Elkan,et al.  Accounting for burstiness in topic models , 2009, ICML '09.

[20]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Ali A. Ghorbani,et al.  An Iterative Hybrid Filter-Wrapper Approach to Feature Selection for Document Clustering , 2009, Canadian Conference on AI.

[22]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[23]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[24]  Edward Y. Chang,et al.  PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications , 2009, AAIM.

[25]  Yuchou Chang,et al.  Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm , 2008, Pattern Recognit..

[26]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[27]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[28]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[29]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[30]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[31]  Kazem Taghva,et al.  Evaluating text categorization in the presence of OCR errors , 2000, IS&T/SPIE Electronic Imaging.

[32]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[33]  Max Welling,et al.  Fast collapsed gibbs sampling for latent dirichlet allocation , 2008, KDD.

[34]  Horst Bunke,et al.  Recognition of cursive Roman handwriting: past, present and future , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[35]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[36]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[37]  Max Welling,et al.  Asynchronous Distributed Learning of Topic Models , 2008, NIPS.

[38]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[39]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[40]  Yiming Yang,et al.  A Probabilistic Model for Online Document Clustering with Application to Novelty Detection , 2004, NIPS.

[41]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[42]  Andrew McCallum,et al.  Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.

[43]  Shipeng Yu,et al.  Advanced probabilistic models for clustering and projection , 2006 .

[44]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[45]  Eric K. Ringger,et al.  A synthetic document image dataset for developing and evaluating historical document processing methods , 2011, Electronic Imaging.

[46]  Venu Govindaraju,et al.  Using topic models for OCR correction , 2009, International Journal on Document Analysis and Recognition (IJDAR).

[47]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[48]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[49]  Evangelos E. Milios,et al.  Latent Dirichlet Co-Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[50]  Eric K. Ringger,et al.  Evaluating Models of Latent Document Semantics in the Presence of OCR Errors , 2010, EMNLP.

[51]  Michael L. Wick,et al.  Context-Sensitive Error Correction: Using Topic Models to Improve OCR , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[52]  Michael,et al.  On a Class of Bayesian Nonparametric Estimates : I . Density Estimates , 2008 .

[53]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[54]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[55]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[56]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[57]  Henry S. Baird,et al.  The State of the Art of Document Image Degradation Modelling , 2007 .

[58]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[59]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[60]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[61]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[62]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[63]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[64]  Eric K. Ringger,et al.  Evaluating supervised topic models in the presence of OCR errors , 2013, Electronic Imaging.

[65]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[66]  David J. Newman,et al.  Probabilistic topic decomposition of an eighteenth-century American newspaper , 2006, J. Assoc. Inf. Sci. Technol..

[67]  Eric K. Ringger,et al.  Improving optical character recognition through efficient multiple system alignment , 2009, JCDL '09.

[68]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[69]  H. Raiffa,et al.  Introduction to Statistical Decision Theory , 1996 .

[70]  Michael A. West,et al.  Computing Nonparametric Hierarchical Models , 1998 .

[71]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[72]  Jason Baldridge,et al.  Supervised Text-based Geolocation Using Language Models on an Adaptive Grid , 2012, EMNLP.

[73]  Eric K. Ringger,et al.  Model-based document clustering with a collapsed gibbs sampler , 2008, KDD.

[74]  Charles Nicholas,et al.  Feature Selection and Document Clustering , 2004 .

[75]  Arindam Banerjee,et al.  Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning , 2007, SDM.

[76]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[77]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[78]  Eric K. Ringger,et al.  Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines , 2011, 2011 International Conference on Document Analysis and Recognition.

[79]  Henry S. Baird,et al.  Document image defect models , 1995 .

[80]  W. Michael Conklin,et al.  Monte Carlo Methods in Bayesian Computation , 2001, Technometrics.

[81]  Eric P. Xing,et al.  A Nonparametric Mixture Model for Topic Modeling over Time , 2012, SDM.

[82]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[83]  Daniel P. Lopresti Performance evaluation for text processing of noisy inputs , 2005, SAC '05.

[84]  Eric C. Jensen,et al.  A Survey of Retrieval Strategies for OCR Text Collections , 2002 .

[85]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[86]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[87]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[88]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[89]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[90]  Xiaohu Zhang,et al.  Training on severely degraded text-line images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[91]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[92]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .