IRRA (IR-Ra) group participated in the 2010 Web track. In this year, the major concern is to examine the effect of supplementary methods on the effectiveness of the new nonparametric index term weighting model, divergence from independence (DFI). Every written text document contains words, but the words used in individual documents may differ due to many divergent (latent) factors, such as topic, author, style, etc. Some words should be intentionally used by authors, in order to compose the information contents of documents, while some words are used due to the grammatical rules. The former set of words is commonly referred to as the keywords or the content bearing words, and the later ones are referred to as the function words or the stop words. Since the function words are used due to the grammatical rules, they should appear, less or more, but in almost all documents, irrespective of (or independently from) the information contents of documents. It is, therefore, reasonable to expect the function words be distributed proportionally to the lengths of documents. On the other hand, since the content bearing words are intentionally used by the authors, their frequency distributions must be affected, and hence should differ from the frequency distributions of the function words on a collection of documents. The content bearing words of a document can be identified by measuring the divergence from independence. According to the DFI model, if the ratio of the frequencies of two different words remains constant for all documents, the occurrences of those words in documents are said to be independent from the documents. Assume that the magnitude of the contribution of a word to the information content of a particular document is proportional to the observed frequency of the word on that document. Then, it can be said that both words contributes to the information contents of all documents, equally. However notice that an equal contribution to the information contents of all documents actually implies no contribution. Such words can only be the words that are used due to a particular reason/rule, such as grammar; because otherwise, a word could not appear in all documents having different information contents. In analogy, the use of HTML tags in Web pages is a good basis to exemplify the independence notion. Since the function words can appear in all documents, not because of their contribution to the information contents of documents, but because of the grammatical rules, they can be thought of as the HTML tags. For instance, every Web page contains exactly two “html” tags and two “body” tags, so the ratio of the frequencies of the “html” and the “body” tags remains constant for all Web pages. According to the independence model, this suggests that the occurrence of “html” tag relative to the “body” tag does not depend on the Web pages, and that the “html” and the “body” tags contribute to the information content of each Web page, equally. It is already known that the HTML tags are used by design, independently from the information contents of the Web pages. But the point in here is that, by using the independence model, this property of HTML tags can be related to their observed frequency distributions on the Web pages, and thereby, it can be recovered without any external knowledge. This definition of independence is easy to understand, but hard to use in practice. In order to use it in practice, it is necessary to measure the degree of independence/dependence between a word and a document, individually. In fact, for each pair of word and document, the independence model can suggest the frequency expected under independence. This enable us to decide whether a particular word is independent from a given document.
[1]
Stephen P. Harter,et al.
A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing
,
1975,
J. Am. Soc. Inf. Sci..
[2]
Stephen P. Harter,et al.
A probabilistic approach to automatic keyword indexing. Part I. On the Distribution of Specialty Words in a Technical Literature
,
1975,
J. Am. Soc. Inf. Sci..
[3]
Stephen E. Robertson,et al.
Probabilistic models of indexing and searching
,
1980,
SIGIR '80.
[4]
David J. Groggel,et al.
Practical Nonparametric Statistics
,
2000,
Technometrics.
[5]
C. J. van Rijsbergen,et al.
Probabilistic models of information retrieval based on measuring the divergence from randomness
,
2002,
TOIS.
[6]
J. V. Bradley.
Distribution-Free Statistical Tests
,
1968
.
[7]
J. Wolfowitz,et al.
Additive Partition Functions and a Class of Statistical Hypotheses
,
1942
.
[8]
Karen Spärck Jones.
A statistical interpretation of term specificity and its application in retrieval
,
2021,
J. Documentation.
[9]
Stephen E. Robertson,et al.
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
,
1994,
SIGIR '94.