Data Structures for Eecient Broker Implementation

With the profusion of text databases on the Internet, it is becoming increasingly hard to nd the most useful databases for a given query. To attack this problem, several existing and proposed systems employ brokers to direct user queries, using a local database of summary information about the available databases. This summary information must e ectively distinguish relevant databases, and must be compact while allowing e cient access. We o er evidence that one broker, GlOSS, can be e ective at locating databases of interest even in a system of hundreds of databases, and examine the performance of accessing the GlOSS summaries for two promising storage methods: the grid le and partitioned hashing. We show that both methods can be tuned to provide good performance for a particular workload (within a broad range of workloads), and discuss the tradeo s between the two data structures. As a side e ect of our work, we show that grid les are more broadly applicable than previously thought; in particular, we show that by varying the policies used to construct the grid le we can provide good performance for a wide range of workloads even when storing highly skewed data.

[1]  Jan Pedersen Optimizations for Dynamic Inverted Index Maintenance Inverted Indices , 1990 .

[2]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[3]  Chris Clifton,et al.  Information Brokers: Sharing Knowledge in a Heterogeneous Distributed System , 1993, DEXA.

[4]  Jim Fullton,et al.  Architecture of the Whois++ Index Service , 1996, RFC.

[5]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[6]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[7]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[8]  Kotagiri Ramamohanarao,et al.  Partial-match retrieval for dynamic files , 1982, BIT.

[9]  Peter B. Danzig,et al.  Distributed Indexing of Autonomous Internet Services , 1992, Comput. Syst..

[10]  B. Clifford Neuman,et al.  The Prospero File System: A Global File System Based on the Virtual System Model , 1992, Comput. Syst..

[11]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[12]  Yelena Yesha,et al.  An Information Retrieval System for Network Resources , 1993, NGITS.

[13]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[14]  Brewster Kahle,et al.  An information system for corporate users: wide area information servers , 1991 .

[15]  Peter B. Danzig,et al.  Internet resource discovery services , 1993, Computer.

[16]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[17]  Sergio Pissanetzky,et al.  Sparse Matrix Technology , 1984 .

[18]  Gio Wiederhold File organization for database design , 1987 .

[19]  Michael Freeston A general solution of the n-dimensional B-tree problem , 1995, SIGMOD '95.

[20]  Peter B. Danzig,et al.  Distributed indexing: a scalable mechanism for distributed information retrieval , 1991, SIGIR '91.

[21]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[22]  Joann J. Ordille,et al.  Distributed active catalogs and meta-data caching in descriptive name services , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[23]  Peter B. Danzig,et al.  Harvest: A Scalable, Customizable Discovery and Access System , 1994 .

[24]  Klaus H. Hinrichs,et al.  Implementation of the grid file: Design concepts and experience , 1985, BIT.

[25]  Klaus H. Hinrichs,et al.  A new algorithm for computing joins with grid files , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[26]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[27]  John W. Lloyd Optimal partial-match retrieval , 1980, BIT Comput. Sci. Sect..

[28]  B. Clifford Neuman,et al.  A Comparison of Internet Resource Discovery Approaches , 1992, Comput. Syst..

[29]  Mark A. Sheldon,et al.  A CONTENT ROUTING SYSTEM FOR DISTRIBUTED INFORMATION SYSTEMS , 1993 .

[30]  Luis Gravano,et al.  Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[31]  T. H. Merrett,et al.  A class of data structures for associative searching , 1984, PODS.

[32]  G. Weiderhold File organization for database design , 1987 .

[33]  Ron Sacks-Davis,et al.  An e cient indexing technique for full-text database systems , 1992, VLDB 1992.

[34]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[35]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[36]  Michael F. Schwartz,et al.  A Scalable, Non-Hierarchical Resource Discovery Mechanism Based on Probabilistic Protocols† , 1990 .

[37]  Andrzej Duda,et al.  Content routing in a network of WAIS servers , 1994, 14th International Conference on Distributed Computing Systems.

[38]  Christos Faloutsos,et al.  Multiattribute hashing using Gray codes , 1986, SIGMOD '86.

[39]  Alfred V. Aho,et al.  Optimal partial-match retrieval when fields are independently specified , 1979, ACM Trans. Database Syst..

[40]  Tim Berners-Lee,et al.  World-Wide Web: The Information Universe , 1992, Electron. Netw. Res. Appl. Policy.