Using two-stage conditional word frequency models to model word burstiness and motivating TF-IDF

Several authors have recently studied the problem of creating exchangeable models for natural languages that exhibit word burstiness. Word burstiness means that a word that has appeared once in a text should be more likely to appear again than it was to appear in the first place. In this article the different existing methods are compared theoretically through a unifying framework. New models that do not satisfy the exchangeability assumption but whose probability revisions only depend on the word counts of what has previously appeared, are introduced within this framework. We will refer to these models as two-stage conditional presence/abundance models since they, just like some recently introduced models for the abundance of rare species in ecology, seperate the issue of presence from the issue of abundance when present. We will see that the widely used TF-IDF heuristic for information retrieval follows naturally from these models by calculating a crossentropy. We will also discuss a connection between TF-IDF and file formats that seperate presence from abundance given presence.

[1]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[2]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[3]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[4]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.

[5]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[6]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[7]  Renato De Mori,et al.  A Cache-Based Natural Language Model for Speech Recognition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Thomas L. Griffiths,et al.  Interpolating between types and tokens by estimating power-law generators , 2005, NIPS.

[9]  David B. Lindenmayer,et al.  MODELING COUNT DATA OF RARE SPECIES: SOME STATISTICAL ISSUES , 2005 .

[10]  Charles Elkan,et al.  Deriving TF-IDF as a Fisher Kernel , 2005, SPIRE.

[11]  F. Chung,et al.  Generalizations of Polya's urn Problem , 2003 .

[12]  P. Manley,et al.  The Multiple Species Inventory and Monitoring Protocol: A Population, Community, and Biodiversity Monitoring Solution for National Forest System Lands , 2006 .

[13]  C. Wagner Commuting Probability Revisions: The Uniformity Rule , 2003 .

[14]  Dennis Day,et al.  The multivariate Polya distribution in combat modeling , 2001 .

[15]  Kenneth Ward Church Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2 , 2000, COLING.

[16]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.