论文信息 - Statistical models for unformatted text

Statistical models for unformatted text

In this note, we will describe some of the outstanding problems concerning statistical information retrieval models, and the underlying stochastic language production models they assume. The problems can be separated into classes according to the underlying language model, which can be either a sequence model or a grammar model. Both kinds of model are based on a stochastic process, but there is a different filter for the realization. The grammar models use a stochastic context sensitive grammar, and the sequence models use a high order Markov chain.Most of these problems cannot be solved without experimentation with information retrieval concepts and systems. Most information retrieval systems that currently exist have had to make operational assumptions about the answers to these questions. It is expected that more precise knowledge of solutions for these problems will simplify the design and improve the effectiveness of statistical information retrieval systems.

Christopher Landauer

[1] Van Rijsbergen,et al. A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[2] Michael McGill,et al. A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment , 1980, SIGIR '80.

[3] Christopher Landauer,et al. Message extraction through estimated relevance , 1979, SIGIR 1979.