论文信息 - Estimating collection size with logistic regression

Estimating collection size with logistic regression

Collection size is an important feature to represent the content summaries of a collection, and plays a vital role in collection selection for distributed search. In uncooperative environments, collection size estimation algorithms are adopted to estimate the sizes of collections with their search interfaces. This paper proposes heterogeneous capture (HC) algorithm, in which the capture probabilities of documents are modeled with logistic regression. With heterogeneous capture probabilities, HC algorithm estimates collection size through conditional maximum likelihood. Experimental results on real web data show that our HC algorithm outperforms both multiple capture-recapture and capture history algorithms.

Sheng Wu | Xing Li | Jingfang Xu

[1] R. Huggins. On the statistical analysis of capture experiments , 1989 .

[2] Ziv Bar-Yossef,et al. Random sampling from a search engine's index , 2006, WWW '06.

[3] King-Lup Liu,et al. Discovering the representative of a search engine , 2001, CIKM '01.

[4] J. Alho. Logistic regression in capture-recapture models. , 1990, Biometrics.

[5] Milad Shokouhi,et al. Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.