Learning user information interests through extraction of semantically significant phrases

InformationFinder is an intelligent agent hat learns user information interests from sets of messages or other on-line documents that users have classified. While this problem has been addressed by a number of recent research initiatives, hiformationFinder’s approach is innovative in a number of ways. First, the agent uses heuristics to extract significant phrases from documents for learning rather than use standard mathematical techniques. This enables it to learn highly general search criteria based on a small number of sample documents. Second, the agent learns standard ecision trees for each user category. These decision trees are easily transformed into search query strings for standard search systems rather than requiring specialized search engines. 1. Large-scale on-line information systems A growing number of businesses and institutions are using distributed information repositories to store large numbers of documents of various types. The growth of Intemet services such as the World Wide Web and Gopher, the continued increase in use of Usenet bulletin boards, and the emergence on the market of distributed database platforms such as Lotus Notes TM all enable organizations of any size to collect and organize large heterogeneous collections of documents ranging from working notes, memos and electronic mail to complete reports, proposals, design documentation, and databases. However, traditional techniques for identifying and gathering relevant documents become unmanageable when the organizations and document collections get very large. This problem exists outside of corporate information repositories as well. On the Internet’s World Wide Web, for instance, it is impossible to even attempt to see all pages that may be of interest. It is equally impossible to simply scan all of the news media (such as newspaper and magazine articles) that are becoming available on the Web. The same is true of other information systems based on the Internet and other world-wide networks, such as Usenet bulletin boards This paper describes an intelligent agent developed to address this problem similar to research systems under development for similar tasks [Holte and Drummond, 1994; Knoblock and Arens, 1994; Levy et. aL, 1994; Pazzani et. aL, 1995] or for other tasks such as e-mail filtering or Usenet message filtering. The agent learns a search query string for each of the user’s interest categories, and searches nightly for new documents hat match these interests to send to the user. Our most significant finding is that effective results depend largely on extracting high-quality indicator phrases from the documents for input to the learning algorithm and less on the particular induction algorithms employed. We present our solution in the context of a Lotus Notes system, consisting of electronic mail, bulletin boards, news services, and databases, but our approach is equally applicable to both the World Wide Web and Usenet. We are planning to make our InformationFinder publicly available for these systems in the near future. 2. Learning user interests Figure 1 shows a user reading a document about Java, a language for Intemet development. Upon reading this document, he user decides that it is representative of his interest in Java. To indicate this to InfoFinder the user selects the "smiley face" icon in the upper right comer. The agent asks the user to categorize his interest in the document, which he gives as "Java." These categories are fully user-specified and need not be given names representative of the content: they are used simply for grouping of documents (e.g., [Gil, 1994; Lieberman, 1994]) and communication with the user. The document is copied into a collection of sample documents for subsequent processing. 110 From: AAAI Technical Report SS-96-05. Compilation copyright © 1996, AAAI (www.aaai.org). All rights reserved.