Customized information extraction as a basis for resource discovery

Indexing file contents is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. We present a model for type-specific, user-customizable information extraction, and a system implementation called Essence. This software structure allows users to associate specialized extraction methods with ordinary files, providing the illusion of an object-oriented file system that encapsulates indexing methods within files. By exploiting the semantics of common file types, Essence generates compact yet representative file summaries that can be used to improve both browsing and indexing in resource discovery systems. Essence can extract information from most of the types of files found in common file systems, including files with nested structure (such as compressed “tar” files). Essence interoperates with a number of different search/index systems (such as WAIS and Glimpse), as part of the Harvest system.

[1]  Peter Honeyman,et al.  Multi-level Caching in Distributed File Systems or Your cache ain't nuthin' but trash , 1992 .

[2]  David Clark,et al.  Architectural considerations for a new generation of protocols , 1990, SIGCOMM 1990.

[3]  Darren R. Hardy,et al.  Essence: A Resource Discovery System Based on Semantic File Indexing , 1993, USENIX Winter.

[4]  Pierre Jouvelot,et al.  Semantic file systems , 1991, SOSP '91.

[5]  Press Niso Information Retrieval Application Service Definition and Protocol Specification for Open Systems Interconnection, Z39.50-1995 , 1994 .

[6]  John Kunze,et al.  A trace-driven analysis of the unix 4 , 1985, SOSP 1985.

[7]  B. Clifford Neuman,et al.  The Prospero File System: A Global File System Based on the Virtual System Model , 1992, Comput. Syst..

[8]  Richard Cohn,et al.  Portable Document Format Reference Manual , 1993 .

[9]  Tim Berners-Lee,et al.  World-Wide Web: The Information Universe , 1992, Electron. Netw. Res. Appl. Policy.

[10]  Darren R. Hardy,et al.  Harvest User's Manual , 1994 .

[11]  Ralph Howard,et al.  Data encryption standard , 1987 .

[12]  Mark A. Sheldon,et al.  Content Routing for Distributed Information Servers , 1994, EDBT.

[13]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[14]  Stephen T. Kent,et al.  Internet Privacy Enhanced Mail , 1993, CACM.

[15]  Nathaniel S. Borenstein,et al.  MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies , 1992, RFC.

[16]  Peter B. Danzig,et al.  The Harvest Information Discovery and Access System , 1995, Comput. Networks ISDN Syst..

[17]  Vincent Cate,et al.  Alex - a Global Filesystem , 1992 .

[18]  Uyless Black Computer networks (2nd ed.): protocols, standards, and interfaces , 1993 .

[19]  C. Mic Bowman,et al.  A File System for Information Management , 1994 .

[20]  Gerard Salton,et al.  Another look at automatic text-retrieval systems , 1986, CACM.

[21]  F. Alan Andersen,et al.  The American National Standards Institute , 1984, IEEE Engineering in Medicine and Biology Magazine.

[22]  John A. Kunze,et al.  A trace-driven analysis of the UNIX 4.2 BSD file system , 1985, SOSP '85.

[23]  Elena Gramatová,et al.  The MD5 Message-Digest Algorithm in the XILINX FPGA , 1994, FPL.

[24]  David D. Clark,et al.  Architectural considerations for a new generation of protocols , 1990, SIGCOMM '90.

[25]  B. Clifford Neuman,et al.  A Comparison of Internet Resource Discovery Approaches , 1992, Comput. Syst..

[26]  Philip Zimmermann,et al.  Pretty good privacy: public key encryption for the masses , 1995 .

[27]  Jim Fullton,et al.  Architecture of the Whois++ Index Service , 1996, RFC.

[28]  Sun Microsystems,et al.  XDR: External Data Representation standard , 1987, RFC.

[29]  Peter B. Danzig,et al.  Scalable Internet resource discovery: research problems and approaches , 1994, CACM.

[30]  J. Postel,et al.  File transfer protocol (FTP) , 1985 .

[31]  Philip R. Zimmermann,et al.  The official PGP user's guide , 1996 .

[32]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[33]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[34]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[35]  Richard S. Hall,et al.  A case for caching file objects inside internetworks , 1993, SIGCOMM 1993.

[36]  M. Andreessen MCSA Mosaic Technical Summary , 1993 .

[37]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[38]  Brewster Kahle,et al.  An information system for corporate users: wide area information servers , 1991 .