Document visual similarity analysis and automated publishing

Managing large document databases has become an important task. Being able to automatically compare document layouts and classify and search documents with respect to their visual appearance proves to be desirable in many applications. We propose a new algorithm that calculates similarity function between documents based on their visual appearance. The comparison is based only on documents' visual appearance without taking into consideration its content. A user may wish to search for documents in a database that are similar to a query in terms of their stylistic features, or he/she may want to browse the whole database. In these tasks, clustering similar documents and organizing the document database with respect to the clusters is preferable to presenting documents in a random order. In the first part of the thesis, we present a document visual similarity measure function and propose organization of single-page documents in a 3-D hierarchical structure called a similarity pyramid. The pyramid is constructed from a stack of document database embeddings on a 2-D surface with the help of a nonlinear dimensionality reduction algorithm called Isomap. Higher levels of the pyramid consist of document image icons that represent a large group of roughly similar documents, whereas lower levels contain document image icons representing small groups of very similar documents. A user can browse the database by moving along a certain level of a pyramid by moving between different levels. In the second part of the thesis we address the problem of automated document layout composition. We present a new paradigm for automated document composition based on hierarchical probabilistic document model (HPDM) that models document composition. Newspaper style documents allow multiple articles to be allocated on the same page. The model formally incorporates key design variables such as content pagination, relative arrangement possibilities for page elements and possible page edits. Aesthetic parameters are modeled probabilistically from data provided by graphic designers. These design choices are modeled jointly as coupled random variables (a Bayesian Network) with uncertainty modeled by their probability distributions. The overall joint probability distribution for the network assigns higher probability to good design choices. Given this model, we show that the general document layout problem can be reduced to probabilistic inference over the Bayesian network.