Incompletely and Imprecisely Speaking: Using Dynamic Ontologies for Representing and Retrieving Information

We report on an approach to representation and retrieval of information from large textual databases. Our approach is based on dynamic ontologies that are automatically constructed from textual data by a new method combining techniques from knowledge representation, natural language processing, and machine learning. The method learns concepts automatically from documents, and uses them to build domain-speciic ontologies and to organize the information contained in the documents. The ontologies generated are dynamic in that they are constantly updated and expanded as new documents are added, requiring minimal supervision from domain experts. Information contained in the documents are eeciently retrieved based on concepts in the ontology, allowing for precision and completeness to be traded oo. A prototype implementation has been very encouraging. 1 Background Our access to data continues to grow exponentially, in terms of both intranets and internets. The enormous growth in the number of on-line textual information sources brings us to a vast amount of information waiting to be discovered, generating intense interest in research on eecient representation and retrieval of tex-tual information. In addition, the database community is becoming increasingly interested in non-conventional 1 c Copyright1999. Microelectronics and Computer Technology Corporation. All Rights Reserved. types of data; timely access to textual information has become more and more important as well. At MCC, the InfoSleuth TM project Bayardo et al. 1996, Fowler et al. 1999, Nodine et al. 1999] aims to retrieve and process information in an ever-changing network of information sources. Recent technologies such as internetworking and the World Wide Web have signiicantly expanded the type, availability and volume of data available to an information management system. However, most of the current Web technologies rely on keyword-based search engines and are incapable of accessing information based on concepts. In-foSleuth integrates new technological developments such as agent technology, domain ontologies, brokerage, and distributed computing, in support of mediated interop-eration of data and services in a dynamic and open environment. Since there is minimal structure in the data on the World Wide Web and this structure usually bears little relationship to the semantics, there can be no static mapping of concepts to structured data sets. Consequently , querying is mostly delegated to search engines that dynamically locate relevant information based on keywords. InfoSleuth views an information source at the level of relevant semantic concepts and aims to deal with possibly incomplete information. Of InfoSleuth technologies , the most relevant for …