Applying Informetric Characteristics of Databases to IR System File Design, Part I: Informetric Models

Abstract This study examines how informetric characteristics of information retrieval (IR) system databases can be used to help the systems designer decide what types of file structures would provide the best performance for a given type of information system environment. In this first of two papers, the development of appropriate models describing database contents, to be used later in a simulation study, are dealt with. Database characteristics for which data were collected include: the index term frequency distribution, the distribution of terms used per query, and the distribution of term frequency selections. A shifted generalized Waring distribution was found to provide the best fit for the index term distributions with the large data sets used. For the terms used per query, a shifted negative binomial was found to provide a reasonable fit. A complex relationship was observed for the term selection distribution data, for which the empirical distribution is used. As well, four other hypothetical term selection relationships are presented. With this information, a simulation study examining system performance under different informetric environments can be undertaken.

[1]  Eugene Wall,et al.  The distribution of term usage in manipulative indexes , 1964 .

[2]  J. Tague,et al.  What's the use of bibliometrics ? , 1988 .

[3]  Jean Tague-Sutcliffe,et al.  Split size-rank models for the distribution of index terms , 1985, J. Am. Soc. Inf. Sci..

[4]  E. J. Yannakoudakis,et al.  The Bibliographic Record: an analysis of the size of its constituent parts , 1979 .

[5]  Isola Ajiferuke,et al.  A probabilistic model for the distribution of authorships , 1988, J. Am. Soc. Inf. Sci..

[6]  Jane Fedorowicz A Zipfian Model of an Automatic Bibliographic System: An Application to MEDLINE , 1982, J. Am. Soc. Inf. Sci..

[7]  Jane Fedorowicz,et al.  The Theoretical Foundation of Zipf's Law and Its Application to the Bibliographic Database Environment , 2007, J. Am. Soc. Inf. Sci..

[8]  Jean Tague-Sutcliffe,et al.  Problems in the simulation of bibliographic retrieval systems , 1980, SIGIR '80.

[9]  Michael John Nelson Probabilistic Models For The Simulation Of Bibliographic Retrieval Systems , 1982 .

[10]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .

[11]  Michael J. Nelson Stochastic Models for the Distribution of Index Terms , 1989, J. Documentation.

[12]  Gerard Salton,et al.  A theory of indexing , 1975, Regional conference series in applied mathematics.

[13]  Jose-Marie Griffiths,et al.  INDEX TERM INPUT TO IR SYSTEMS , 1975 .

[14]  Martha E. Williams,et al.  Data Element Statistics for the MARC II Data Base. , 1976 .

[15]  Michael D. Cooper,et al.  A simulation model of an information retrieval system , 1973, Inf. Storage Retr..

[16]  P. Zunde,et al.  Distribution of indexing terms for maximum efficiency of information transmission , 1967 .

[17]  A. Bendell,et al.  Rank Order Distributions and Secondary Key Indexing , 1985, Computer/law journal.

[18]  Dietmar Wolfram,et al.  Applying Informetric Characteristics of Databases to IR System File Design, Part II: Simulation Comparisons , 1992, Inf. Process. Manag..

[19]  J. O. Irwin,et al.  The Generalized Waring Distribution. Part III , 1975 .

[20]  J. Berkson MINIMUM CHI-SQUARE, NOT MAXIMUM LIKELIHOOD! , 1980 .

[21]  Michael J. Nelson,et al.  Correlation of term usage and term indexing frequencies , 1988, Inf. Process. Manag..

[22]  Paul Nicholls,et al.  The maximal value of a zipf size variable: Sampling properties and relationship to other parameters , 1987, Inf. Process. Manag..