Learning Determinantal Point Processes

The increasing availability of both interesting data and processing capacity has led to widespread interest in machine learning techniques that deal with complex, structured output spaces in fields like image processing, computational biology, and natural language processing. By making multiple interrelated decisions at once, these methods can achieve far better performance than is possible treating each decision in isolation. However, accounting for the complexity of the output space is also a significant computational burden that must be balanced against the modeling advantages. Graphical models, for example, offer efficient approximations when considering only local, positive interactions. The popularity of graphical models attests to the fact that these restrictions can be a good fit in some cases, but there are also many other interesting tasks for which we need new models with new assumptions. In this thesis we show how determinantal point processes (DPPs) can be used as probabilistic models for binary structured problems characterized by global, negative interactions. Samples from a DPP correspond to subsets of a fixed ground set, for instance, the documents in a corpus or possible locations of objects in an image, and their defining characteristic is a tendency to be diverse. Thus, DPPs can be used to choose diverse sets of high-quality search results, to build informative summaries by selecting diverse sentences from documents, or to model non-overlapping human poses in images or video. DPPs arise in quantum physics and random matrix theory from a number of interesting theoretical constructions, but we show how they can also be used to model real-world data; we develop new extensions, algorithms, and theoretical results that make modeling and learning with DPPs efficient and practical. Throughout, we demonstrate experimentally that the techniques we introduce allow DPPs to be used for performing real-world tasks like document summarization, multiple human pose estimation, search diversification, and the threading of large document collections.

[1]  J. Besag,et al.  Spatial Statistics and Bayesian Computation , 1993 .

[2]  Avner Magen,et al.  Near Optimal Dimensionality Reductions That Preserve Volumes , 2008, APPROX-RANDOM.

[3]  Peter Bürgisser The Complexity of Immanants , 2000 .

[4]  R. Swendsen Dynamics of random sequential adsorption , 1981 .

[5]  R. Waagepetersen,et al.  Modern Statistics for Spatial Point Processes * , 2007 .

[6]  Vahab S. Mirrokni,et al.  Non-monotone submodular maximization under matroid and knapsack constraints , 2009, STOC '09.

[7]  D. J. Strauss A model for clustering , 1975 .

[8]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[9]  R. Wolpert,et al.  Perfect simulation and moment properties for the Matérn type III process , 2010 .

[10]  Yousef Saad,et al.  A Probing Method for Computing the Diagonal of the Matrix Inverse ∗ , 2010 .

[11]  Vladimir Kolmogorov,et al.  What energy functions can be minimized via graph cuts? , 2002, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[13]  T. Shirai,et al.  Random point fields associated with certain Fredholm determinants I: fermion, Poisson and boson point processes , 2003 .

[14]  Persi Diaconis,et al.  Immanants and Finite Point Processes , 2000, J. Comb. Theory A.

[15]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[16]  R. Lyons Determinantal probability measures , 2002, math/0204325.

[17]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[18]  Peter Bürgisser,et al.  The Computational Complexity of Immanants , 2000, SIAM J. Comput..

[19]  Jesper Møller,et al.  Bayesian Analysis of Markov Point Processes , 2006 .

[20]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[21]  Luis Rademacher,et al.  Efficient Volume Sampling for Row/Column Subset Selection , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[22]  L. Einkemmer Quasi-Monte Carlo methods , 2010 .

[23]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[24]  James Allan,et al.  Temporal summaries of new topics , 2001, SIGIR '01.

[25]  Vahab S. Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2011, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[26]  J. L. Jensen,et al.  Pseudolikelihood for Exponential Family Models of Spatial Point Processes , 1991 .

[27]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[28]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[29]  Shankar Kumar,et al.  Minimum Bayes-Risk Word Alignments of Bilingual Texts , 2002, EMNLP.

[30]  J. Halton On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals , 1960 .

[31]  Tommi S. Jaakkola,et al.  New Outer Bounds on the Marginal Polytope , 2007, NIPS.

[32]  P. Diggle,et al.  On parameter estimation for pairwise interaction point processes , 1994 .

[33]  A. Baddeley,et al.  Area-interaction point processes , 1993 .

[34]  R. Cowan An introduction to the theory of point processes , 1978 .

[35]  E. Hlawka Funktionen von beschränkter Variatiou in der Theorie der Gleichverteilung , 1961 .

[36]  A. Soshnikov,et al.  Janossy Densities. I. Determinantal Ensembles , 2002, math-ph/0212063.

[37]  Carlos Guestrin,et al.  A Note on the Budgeted Maximization of Submodular Functions , 2005 .

[38]  J. Clarke,et al.  Global inference for sentence compression : an integer linear programming approach , 2008, J. Artif. Intell. Res..

[39]  Alexander Schrijver,et al.  A Combinatorial Algorithm Minimizing Submodular Functions in Strongly Polynomial Time , 2000, J. Comb. Theory B.

[40]  P. Diggle,et al.  A nonparametric estimator for pairwise-interaction point processes , 1987 .

[41]  P. Diaconis,et al.  On adding a list of numbers (and other one-dependent determinantal processes) , 2009, 0904.3740.

[42]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[43]  L. Finegold,et al.  Maximum density of random placing of membrane particles , 1979, Nature.

[44]  A. Barvinok Computational complexity of immanents and representations of the full linear group , 1990 .

[45]  F. Dyson Statistical Theory of the Energy Levels of Complex Systems. I , 1962 .

[46]  Jeffrey D. Scargle,et al.  An Introduction to the Theory of Point Processes, Vol. I: Elementary Theory and Methods , 2004, Technometrics.

[47]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[48]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[49]  K. Johansson The Arctic circle boundary and the airy process , 2003, math/0306216.

[50]  Noah A. Smith,et al.  Summarization with a Joint Model for Sentence Extraction and Compression , 2009, ILP 2009.

[51]  Endre Boros,et al.  Pseudo-Boolean optimization , 2002, Discret. Appl. Math..

[52]  A. Soshnikov Determinantal random point fields , 2000, math/0002099.

[53]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[54]  Dafna Shahaf,et al.  Connecting the dots between news articles , 2010, IJCAI.

[55]  Dafna Shahaf,et al.  Trains of thought: generating information maps , 2012, WWW.

[56]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[57]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[58]  D. Stoyan,et al.  On One of Matérn's Hard‐core Point Process Models , 1985 .

[59]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[60]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[61]  Y. Ogata,et al.  Likelihood Analysis of Spatial Point Patterns , 1984 .

[62]  B. Ripley Statistical inference for spatial processes , 1990 .

[63]  J. Besag,et al.  Point process limits of lattice processes , 1982, Journal of Applied Probability.

[64]  T. Shirai,et al.  Fermion Process and Fredholm Determinant , 2000 .

[65]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[66]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.

[67]  J. Feder Random sequential adsorption , 1980 .

[68]  Ben Taskar,et al.  Structured Determinantal Point Processes , 2010, NIPS.

[69]  Ben Taskar,et al.  Adaptive pose priors for pictorial structures , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[70]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[71]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[72]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[73]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[74]  A. Okounkov,et al.  Correlation function of Schur process with application to local geometry of a random 3-dimensional Young diagram , 2001, math/0107056.

[75]  M. L. Mehta,et al.  ON THE DENSITY OF EIGENVALUES OF A RANDOM MATRIX , 1960 .

[76]  O. Macchi The coincidence approach to stochastic point processes , 1975, Advances in Applied Probability.

[77]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[78]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[79]  Guy Lapalme,et al.  HEXTAC: the Creation of a Manual Extractive Run , 2009, TAC.

[80]  Yair Weiss,et al.  Linear Programming Relaxations and Belief Propagation - An Empirical Study , 2006, J. Mach. Learn. Res..

[81]  Jean-Luc Brylinski,et al.  Complexity and Completeness of Immanants , 2003, ArXiv.

[82]  Ben Taskar,et al.  Learning associative Markov networks , 2004, ICML.

[83]  Zhifei Li,et al.  First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests , 2009, EMNLP.

[84]  Hui Lin,et al.  Learning Mixtures of Submodular Shells with Application to Document Summarization , 2012, UAI.

[85]  G. Olshanski,et al.  Distributions on Partitions, Point Processes,¶ and the Hypergeometric Kernel , 1999, math/9904010.

[86]  M. Tanemura On random complete packing by discs , 1979 .

[87]  Y. Peres,et al.  Determinantal Processes and Independence , 2005, math/0503110.

[88]  A. Okounkov Infinite wedge and random partitions , 1999, math/9907127.

[89]  Ted J. Case,et al.  Overdispersion of ant colonies: a test of hypotheses , 1986, Oecologia.

[90]  K. Johansson Random matrices and determinantal processes , 2005, math-ph/0510038.

[91]  Solomon Eyal Shimony,et al.  Finding MAPs for Belief Networks is NP-Hard , 1994, Artif. Intell..

[92]  Hiroshi Ishikawa,et al.  Exact Optimization for Markov Random Fields with Convex Priors , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[93]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[94]  B. Ripley,et al.  Markov Point Processes , 1977 .

[95]  J. Ginibre Statistical Ensembles of Complex, Quaternion, and Real Matrices , 1965 .

[96]  I. Sobol,et al.  On quasi-Monte Carlo integrations , 1998 .

[97]  Y. Ogata,et al.  Estimation of Interaction Potentials of Marked Spatial Point Patterns Through the Maximum Likelihood Method , 1985 .

[98]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[99]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[100]  Fernando Pereira,et al.  Structured Learning with Approximate Inference , 2007, NIPS.

[101]  Ashraf M. Abdelbar,et al.  Approximating MAPs for Belief Networks is NP-Hard and Other Theorems , 1998, Artif. Intell..

[102]  Maurice Queyranne,et al.  An Exact Algorithm for Maximum Entropy Sampling , 1995, Oper. Res..

[103]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[104]  I. Sobol On the distribution of points in a cube and the approximate evaluation of integrals , 1967 .

[105]  J. Ramsden Review of new experimental techniques for investigating random sequential adsorption , 1993 .

[106]  Olle Häggström,et al.  Characterization results and Markov chain Monte Carlo algorithms including exact simulation for some spatial point processes , 1999 .

[107]  Michael R. Harwell,et al.  Computing Elementary Symmetric Functions and Their Derivatives: A Didactic , 1996 .

[108]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[109]  Yan Zhang,et al.  Evolutionary timeline summarization: a balanced optimization framework via iterative substitution , 2011, SIGIR.

[110]  K. Johansson Determinantal Processes with Number Variance Saturation , 2004, math/0404133.

[111]  John M. Conroy Left-Brain/Right-Brain Multi-Document Summarization , 2004 .

[112]  Malik Magdon-Ismail,et al.  On selecting a maximum volume sub-matrix of a matrix and related problems , 2009, Theor. Comput. Sci..

[113]  G. Grimmett A THEOREM ABOUT RANDOM FIELDS , 1973 .

[114]  R. Grone,et al.  An algorithm for the second immanant , 1984 .

[115]  Vaibhava Goel,et al.  Minimum Bayes-risk automatic speech recognition , 2000, Comput. Speech Lang..

[116]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[117]  K. Johansson Non-intersecting paths, random tilings and random matrices , 2000, math/0011250.

[118]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[119]  R. Wolpert,et al.  Likelihood-based inference for Matérn type-III repulsive point processes , 2009, Advances in Applied Probability.

[120]  Hai Leong Chieu,et al.  Query based event extraction along a timeline , 2004, SIGIR '04.

[121]  Paul Bratley,et al.  Algorithm 659: Implementing Sobol's quasirandom sequence generator , 1988, TOMS.

[122]  Hoa Trang Dang,et al.  Overview of DUC 2005 , 2005 .

[123]  David Jensen,et al.  TimeMines: Constructing Timelines with Statistical Models of Word Usage , 2000, KDD 2000.

[124]  Yair Weiss,et al.  Approximate Inference and Protein-Folding , 2002, NIPS.

[125]  N. O'Connell,et al.  PATTERNS IN EIGENVALUES: THE 70TH JOSIAH WILLARD GIBBS LECTURE , 2003 .

[126]  E. Rains,et al.  Eynard–Mehta Theorem, Schur Process, and their Pfaffian Analogs , 2004, math-ph/0409059.

[127]  B. Matérn Spatial variation : Stochastic models and their application to some problems in forest surveys and other sampling investigations , 1960 .

[128]  T. Shirai,et al.  Random point fields associated with certain Fredholm determinants II: Fermion shifts and their ergodic and Gibbs properties , 2003 .

[129]  Ani Nenkova,et al.  A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization , 2006, SIGIR.

[130]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..