Probabilistic Document Modelling

In this thesis the development and application of probabilistic models of documents is considered. The initial focus is on language models which provide a way of modelling plain text documents. In particular the hierarchical Dirichlet language model, which is derived from simple Bayesian theory, is investigated and is shown to be well approximated by an existing method known as generalised PPM-A. Using this equivalence, generalised PPM-A is extended to produce a language model which while working on the level of individual letter-like symbols is able to make use of the division of the text stream into words. It is shown that the new model can be used in conjunction with a word list to improve performance when very little information from which to learn the statistics of the language is available. The hierarchical Dirichlet model is then applied to the task of information retrieval, producing a new retrieval method which naturally includes document frequency information. This information has traditionally been used in retrieval systems, but previously had either been missing or introduced heuristically in language model based approaches to the problem. The hierarchical approach is also extended to the task of retrieval at the passage level where it is shown to give promising results. Finally, the scope of the investigation is broadened to include documents which contain diagrams as well as plain text. A method is developed to group fragments of digitised ink strokes into perceptually relevant components of a diagram, while at the same time labelling the components with an object class. The approach, which is based on the conditional random field, is shown to work well both in terms of grouping and improving labelling performance when compared to other methods.

[1]  Martin Szummer,et al.  A Graphical Model for Simultaneous Partitioning and Labeling , 2005, AISTATS.

[2]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[3]  Keith Vertanen Efficient Computer Interfaces Using Continuous Gestures, Language Models, and Speech , 2005 .

[4]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[5]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[6]  Eric Saund,et al.  Finding Perceptually Closed Paths in Sketches and Drawings , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  H. Jeffreys,et al.  The Theory of Probability , 1896 .

[8]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[10]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[11]  Yee Whye Teh,et al.  A Bayesian Interpretation of Interpolated Kneser-Ney , 2006 .

[12]  R. Nigel Horspool,et al.  Constructing word-based text compression algorithms , 1992, Data Compression Conference, 1992..

[13]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[14]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[15]  Arthur Nádas,et al.  On Turing's formula for word probabilities , 1985, IEEE Trans. Acoust. Speech Signal Process..

[16]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[17]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Wang,et al.  Nonuniversal critical dynamics in Monte Carlo simulations. , 1987, Physical review letters.

[19]  Yuan Qi,et al.  Diagram structure recognition by Bayesian conditional random fields , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  R. T. Cox The Algebra of Probable Inference , 1962 .

[21]  John G. Cleary,et al.  Unbounded Length Contexts for PPM , 1997 .

[22]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[23]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[24]  Michael I. Jordan Graphical Models , 2003 .

[25]  E. Jaynes Probability theory : the logic of science , 2003 .

[26]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[27]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[28]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[29]  N. J. A. Sloane,et al.  The On-Line Encyclopedia of Integer Sequences , 2003, Electron. J. Comb..

[30]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[31]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[32]  Thomas L. Ainscough Thomas L , 2005 .

[33]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[34]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[35]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[36]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[37]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[39]  ChengXiang Zhai,et al.  Probabilistic Relevance Models Based on Document and Query Generation , 2003 .

[40]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[41]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[42]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[43]  Adrian Barbu,et al.  Graph partition by Swendsen-Wang cuts , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[44]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[45]  John Blitzer,et al.  Distributed Latent Variable Models of Lexical Co-occurrences , 2005, AISTATS.

[46]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[47]  James P. Callan,et al.  Experiments Using the Lemur Toolkit , 2001, TREC.

[48]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[49]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[50]  Stanley F. Chen,et al.  Building Probabilistic Models for Natural Language , 1996, ArXiv.

[51]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[52]  John Lafferty,et al.  Information retrieval as statistical translation , 1999, SIGIR 1999.

[53]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[54]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[55]  David J. Ward,et al.  Adaptive Computer Interfaces , 2001 .

[56]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[57]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[58]  A. Dawid Conditional Independence for Statistical Operations , 1980 .

[59]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[60]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986 .

[61]  Philip J. Cowans Information retrieval using hierarchical dirichlet processes , 2004, SIGIR '04.

[62]  Alan F. Blackwell,et al.  Dasher—a data entry interface using continuous gestures and language models , 2000, UIST '00.

[63]  Justin Zobel,et al.  Effective ranking with arbitrary passages , 2001 .

[64]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[65]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[66]  David J. Ward,et al.  Fast Hands-free Writing by Gaze Direction , 2002, ArXiv.

[67]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[68]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[69]  Prakash P. Shenoy,et al.  Axioms for probability and belief-function proagation , 1990, UAI.

[70]  Ian H. Witten,et al.  A comparison of enumerative and adaptive codes , 1984, IEEE Trans. Inf. Theory.

[71]  DeLiang Wang,et al.  Perceptual organization based on temporal dynamics , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[72]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[73]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[74]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[75]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[76]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[77]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[78]  Timothy C. Bell,et al.  A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[79]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[80]  Martin Szummer,et al.  Incorporating Context and User Feedback in Pen-Based Interfaces , 2004, AAAI Technical Report.

[81]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[82]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[83]  W. Bruce Croft,et al.  Passage retrieval based on language models , 2002, CIKM '02.

[84]  David J. C. MacKay,et al.  A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[85]  Jean Carletta Modelling Variations in Goal-Directed Dialogue , 1990, COLING.

[86]  Hanna M. Wallach,et al.  Conditional Random Fields: An Introduction , 2004 .

[87]  David J. Ward,et al.  Artificial intelligence: Fast hands-free writing by gaze direction , 2002, Nature.

[88]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[89]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[90]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.

[91]  Mitchell Kb,et al.  Web references , 2007, Ship and Mobile Offshore Unit Automation.

[92]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[93]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[94]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[95]  S. Robertson The probability ranking principle in IR , 1997 .