Fundamental limits for information retrieval

The fundamental limits of performance for a general model of information retrieval from databases are studied. In the scenarios considered, a large quantity of information is to be stored on some physical storage device. Requests for information are modeled as a randomly generated sequence with a known distribution. The requests are assumed to be "context-dependent," i.e., to vary according to the sequence of previous requests. The state of the physical storage device is also assumed to depend on the history of previous requests. In general, the logical structure of the information to be stored does not match the physical structure of the storage device, and consequently there are nontrivial limits on the minimum achievable average access times, where the average is over the possible sequences of user requests. The paper applies basic information-theoretic methods to establish these limits and demonstrates constructive procedures that approach them, for a wide class of systems. Allowing redundancy greatly lowers the achievable access times, even when the amount added is an arbitrarily small fraction of the total amount of information in the database. The achievable limits both with and without redundancy are computed; in the case where redundancy is allowed the limits essentially coincide with lower limits for more general storage systems.

[1]  James Richard Roche Distributed information storage , 1992 .

[2]  Randy H. Katz,et al.  High-performance network and channel-based storage , 1992, Proc. IEEE.

[3]  Anna R. Karlin,et al.  Markov Paging , 2000, SIAM J. Comput..

[4]  P. Diaconis Group representations in probability and statistics , 1988 .

[5]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[6]  E. Montroll,et al.  Random Walks on Lattices. II , 1965 .

[7]  Jr. Allen B. Tucker,et al.  The Computer Science and Engineering Handbook , 1997 .

[8]  Reagan Moore,et al.  Towards the Interoperability of Web, Database, and Mass Storage Technologies for Petabyte Archives , 1996 .

[9]  Mikhail J. Atallah,et al.  Optimal simulations between mesh-connected arrays of processors , 1988, JACM.

[10]  Amos Fiat,et al.  Randomized and multipointer paging with locality of reference , 1995, STOC '95.

[11]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[12]  Richard J. Lipton,et al.  Preserving average proximity in arrays , 1978, CACM.

[13]  Randy H. Katz,et al.  The Performance of Parity Placements in Disk Arrays , 1993, IEEE Trans. Computers.

[14]  Allan Borodin,et al.  Competitive paging with locality of reference , 1991, STOC '91.

[15]  Peter Winkler,et al.  Optimal linear arrangement of a rectangular grid , 2000, Discret. Math..

[16]  Randy H. Katz,et al.  An Analysis of File Migration in a UNIX Supercomputing Environment , 1993, USENIX Winter.

[17]  S. Sechrest,et al.  Information retrieval from databases , 1995, Proceedings of 1995 IEEE International Symposium on Information Theory.

[18]  Carsten Lund,et al.  IP over connection-oriented networks and distributional paging , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[19]  G. Weiss Aspects and Applications of the Random Walk , 1994 .

[20]  T. Apostol Introduction to analytic number theory , 1976 .

[21]  David L. Cohn,et al.  Using Redundancy to Speed up Disk Arrays , 1994 .

[22]  E. Montroll Random walks on lattices , 1969 .

[23]  H. Sagan Space-filling curves , 1994 .

[24]  Patrick E. O'Neil Database Performance Measurement , 1997, The Computer Science and Engineering Handbook.

[25]  G. Lawler Intersections of random walks , 1991 .

[26]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[27]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[28]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[29]  Marshall K. McKusick,et al.  Secondary storage and filesystems , 1996, CSUR.

[30]  R. Durbin,et al.  Optimal numberings of an N N array , 1986 .

[31]  Theodore Johnson,et al.  Benchmarking Tape System Performance , 1998 .

[32]  Steven Phillips,et al.  On-line Algorithms , 1999, Algorithms and Theory of Computation Handbook.

[33]  Rangasami L. Kashyap,et al.  Data placement for large read-only interactive multimedia information systems on multidisk environment , 1993, Electronic Imaging.

[34]  Abraham Lempel,et al.  Compression of two-dimensional data , 1986, IEEE Trans. Inf. Theory.

[35]  Philippe Bonnet,et al.  Tuning Database Design for High Performance , 2014, Computing Handbook, 3rd ed..

[36]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[37]  Moni Naor,et al.  Optimal File Sharing in Distributed Networks , 1995, SIAM J. Comput..

[38]  Sandy Irani,et al.  Strongly competitive algorithms for paging with locality of reference , 1992, SODA '92.