Adaptive Access for a Digital Library of Corporate Information

pipes technology of UNIX, so that both the inpu t and the resulting output files must be characte r streams (not indexed, addressable databases) . Corpus Linguistics analysis is a necessity, sinc e manual effort will not be able to furnish information on all the new words and phrases and domain s of interest that steadily appear, We are puttin g Corpus Linguistics on a new footing by building it around Object-Oriented database technology, OODB . In our initial studies we are using th e technology to do unsupervised (bootstrap) classification of words . The results can aid query expansion, automatic abstracting, disambiguation o f terms, identification of specialized knowledge do mains, and more . An important extension of the technique that we are also pursuing is to discove r higher-order structures in text, such as nou n phrases and clauses and their relations . The techniques and results will be described for initia l studies on 4M words of the Journal of Bacteriology. The system is built on top of Wood, a persistent heap system for Macintosh Common Lisp with B-tree indexing and a full inverted word index o f the corpus .