Effective, efficient retrieval in a network of digital information objects

Although different authors mean different thing by the term “digital libraries,” one common thread is that they include or are built around collections of digital objects. Digital libraries also provide services to large communities, one of which is almost always search. Digital library collections, however, have several characteristic features that make search difficult. They are typically very large. They typically involve many different kinds of objects, including but not limited to books, e-published documents, images, and hypertexts, and often including items as esoteric as subtitled videos, simulations, and entire scientific databases. Even within a category, these objects may have widely different formats and internal structure. Furthermore, they are typically in complex relationships with each other and with such non-library objects as persons, institutions, and events. Relationships are a common feature of traditional libraries in the form of “See/See also” pointers, hierarchical relationships among categories, and relations between bibliographic and non-bibliographic objects such as having an author or being on a subject. Binary relations (typically in the form of directed links) are a common representational tool in computer science for structures from trees and graphs to semantic networks. And in recent years the World-Wide Web has made the construct of linked information objects commonplace for millions. Despite this, relationships have rarely been given “first-class” treatment in digital library collections or software. MARIAN is a digital library system designed and built to store, search over, and retrieve large numbers of diverse objects in a network of relationships. It is designed to run efficiently over large collections of digital library objects. It addresses the problem of object diversity through a system of classes unified by common abilities including searching and presentation. Divergent internal structure is exposed and interpreted using a simple and powerful graphical representation, and varied format through a unified system of presentation. Most importantly, MARIAN collections are designed to specifically include relations in the form of an extensible collection of different sorts of links. This thesis presents MARIAN and argues that it is both effective and efficient. MARIAN is effective in that it provides new and useful functionality to digital library end-users, and in that it makes constructing, modifying, and combining collections easy for library builders and maintainers. MARIAN is efficient since it works from an abstract presentation of search over networked collections to define on the one hand common operations required to implement a broad class of search engines, and on the other performance standards for those operations. Although some operations involve a high minimum cost under the most general assumptions, lower costs can be achieved when additional constraints are present. In particular, it is argued that the statistics of digital library collections can be exploited to obtain significant savings. MARIAN is designed to do exactly that, and in evidence from early versions appears to succeed. In conclusion, MARIAN presents a powerful and flexible platform for retrieval on large, diverse collections of networked information, significantly extending the representation and search capabilities of digital libraries.

[1]  Edward A. Fox,et al.  Practical minimal perfect hash functions for large databases , 1992, CACM.

[2]  Hans-Jürgen Zimmermann,et al.  Fuzzy Set Theory - and Its Applications , 1985 .

[3]  Elke A. Rundensteiner,et al.  Maintaining data warehouses over changing information sources , 2000, CACM.

[4]  Edward A. Fox,et al.  Use and usability in a digital library search system , 1999, ArXiv.

[5]  Andreas Paepcke,et al.  A mediation infrastructure for digital library services , 2000, DL '00.

[6]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[7]  Edward A. Fox,et al.  Order-preserving minimal perfect hash functions and information retrieval , 1991, TOIS.

[8]  Charles R. Hildreth,et al.  Beyond boolean: designing the next generation of online catalogs , 1987 .

[9]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[10]  William H. Mischo Library of congress subject headings: A review of the problems, and prospects for improved subject access , 1982 .

[11]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[12]  Vijayalakshmi Atluri,et al.  SI in digital libraries , 2000, CACM.

[13]  C. J. van Rijsbergen,et al.  The nearest neighbour problem in information retrieval: an algorithm using upperbounds , 1981, SIGIR '81.

[14]  Joseph Devlin,et al.  Hypertext/hypermedia handbook , 1991 .

[15]  Edward A. Fox,et al.  Development of the coder system: A testbed for artificial intelligence methods in information retrieval , 1987, Inf. Process. Manag..

[16]  Edward A. Fox,et al.  Integrating search and retrieval with hypertext , 1991 .

[17]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[18]  Edward A. Fox,et al.  The evolving genre of electronic theses and dissertations , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[19]  Edward A. Fox,et al.  A faster algorithm for constructing minimal perfect hash functions , 1992, SIGIR '92.

[20]  Edward A. Fox,et al.  National Digital Library of Theses and Dissertations: A Scalable and Sustainable Approach to Unlock University Resources , 1996, D Lib Mag..

[21]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[22]  Edward A. Fox,et al.  NDLTD: Preparing the next generation of scholars for the information age , 1997 .

[23]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[24]  Dik Lun Lee,et al.  Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[25]  Kurt Maly,et al.  SODA: Smart Objects, Dumb Archives , 1999, ECDL.

[26]  Edward A. Fox,et al.  A digital library for authors: recent progress of the networked digital library of theses and dissertations , 1999, DL '99.

[27]  Qifan Chen,et al.  An object-oriented database system for efficient information retrieval applications , 1992 .

[28]  Martha M. Yee System design and cataloging meet the user: User interfaces to online public access catalogs , 1991 .

[29]  Peter Mc Brien,et al.  Automatic Migration and Wrapping of Database Applications — A Schema Transformation Approach , 1999 .

[30]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[31]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[32]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[33]  Amit P. Sheth,et al.  Semantic interoperability in global information systems , 1999, SGMD.

[34]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[35]  Edward A. Fox,et al.  Flexible Interoperability in a Federated Digital Library of Theses and Dissertations , 2001 .

[36]  Edward A. Fox,et al.  Networked Digital Library of Theses and Dissertations: An International Effort Unlocking University Resources , 1997, D Lib Mag..

[37]  Dario Lucarella,et al.  A document retrieval system based on nearest neighbour searching , 1988, J. Inf. Sci..

[38]  John O'Connor,et al.  Answer-passage retrieval by text searching , 1980, J. Am. Soc. Inf. Sci..

[39]  Teuvo Kohonen,et al.  Associative memory. A system-theoretical approach , 1977 .

[40]  Tefko Saracevic,et al.  Individual Differences in Organizing, Searching and Retrieving Information. , 1991 .

[41]  C. Gordon Bell Computer Engineering , 1998 .

[42]  Helmut Hasse,et al.  Number Theory , 2020, An Introduction to Probabilistic Number Theory.

[43]  Dan Suciu,et al.  Declarative specification of Web sites with Strudel , 2000, The VLDB Journal.

[44]  Bernie Sloan Online Public Access Catalogs , 1991 .

[45]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[46]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[47]  Clifford A. Lynch,et al.  The Z39.50 Information Retrieval Standard: Part I: A Strategic View of Its Past, Present and Future , 1997, D-Lib Magazine.

[48]  Chen C. Chang,et al.  Model Theory: Third Edition (Dover Books On Mathematics) By C.C. Chang;H. Jerome Keisler;Mathematics , 1966 .

[49]  Kevin Chen-Chuan Chang,et al.  Interoperability for digital libraries worldwide , 1998, CACM.

[50]  J. Neukirch Algebraic Number Theory , 1999 .

[51]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[52]  Luis Gravano,et al.  Merging Ranks from Heterogeneous Internet Sources , 1997, VLDB.

[53]  Jie Wu,et al.  Small Worlds: The Dynamics of Networks between Order and Randomness , 2003 .

[54]  James C. French,et al.  Growth and server availability of the NCSTRL digital library , 2000, DL '00.

[55]  Edward A. Fox,et al.  Multilingual Federated Searching Across Heterogeneous Collections , 1998, D Lib Mag..

[56]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[57]  Herbert Van de Sompel,et al.  The Santa Fe Convention of the Open Archives Initiative , 2000, D Lib Mag..

[58]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[59]  Edward Fox,et al.  Extending the boolean and vector space models of information retrieval with p-norm queries and multiple concept types , 1983 .

[60]  Edward A. Fox,et al.  Architecture of an expert system for composite document analysis, representation, and retrieval , 1997, Int. J. Approx. Reason..

[61]  Herbert Van de Sompel,et al.  The open archives initiative , 2001 .

[62]  Edward A. Fox,et al.  Development of a modern OPAC: from REVTOLC to MARIAN , 1993, SIGIR.

[63]  G. Miller,et al.  Some effects of intermittent silence. , 1957, The American journal of psychology.

[64]  Norbert Fuhr Towards Data Abstraction in Networked Information Retrieval Systems , 1999, Inf. Process. Manag..

[65]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[66]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[67]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[68]  Dik Lun Lee,et al.  Partial Document Ranking by Heuristic Methods , 1991, ICCI.

[69]  Amit P. Sheth,et al.  Semantic Interoperability in Global Information Systems: A Brief Introduction to the Research Area a , 1999 .

[70]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[71]  G. B. Mathews Theory of numbers , 1963 .

[72]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[73]  Sandra Payette,et al.  Making global digital libraries work: collection services, connectivity regions, and collection views , 1998, DL '98.

[74]  Christine L. Borgman,et al.  Why are Online Catalogs Hard to Use? Lessons Learned from Information=Retrieval Studies , 1986 .

[75]  Ned Glick,et al.  Data Mining and Knowledge Discovery in Databases – An Overview , 1999 .