Learning to Query the Web

The World Wide Web (WWW) is filled with "resource directories"--i.e., documents that collect together links to all known documents on a specific topic. Keeping resource directories up-to-date is difficult because of the rapid growth in online documents. We propose using machine learning methods to address this problem. In particular, we propose to treat a resource directory as a list of positive examples of an unknown concept, and then use machine learning methods to construct from these examples a definition of the unknown concept. If the learned definition is in the appropriate form, it can be translated into a query, or series of queries, for a WWW search engine. This query can be used at a later date to detect any new instances of the concept. We present experimental results with two implemented systems, and two learning methods. One system is interactive, and is implemented as an augmented WWW browser; the other is a batch system, which can collect and label documents without any human intervention. The learning methods are the RIPPER rule learning system, and a rule-learning version of a new online weight allocation algorithm called the sleeping experts prediction algorithm. The experiments are performed on real data obtained from the WWW.

[1]  Edward A. Fox,et al.  Automatic query formulations in information retrieval , 1983, J. Am. Soc. Inf. Sci..

[2]  Edward A. Fox,et al.  Advanced feedback methods in information retrieval , 1985, J. Am. Soc. Inf. Sci..

[3]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[4]  Robert C. Holte,et al.  Concept Learning and the Problem of Small Disjuncts , 1989, IJCAI.

[5]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[6]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[7]  Avrim Blum,et al.  Learning boolean functions in an infinite attribute space , 1990, STOC '90.

[8]  H. Sebastian Seung,et al.  Information, Prediction, and Query by Committee , 1992, NIPS.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[11]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[12]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[13]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[14]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[15]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[16]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Based Algorithms: Results on a Calendar Scheduling Domain , 1995, ICML.

[17]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[18]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[19]  William W. Cohen Fast Eeective Rule Induction , 1995 .

[20]  William W. Cohen Text Categorization and Relational Learning , 1995, ICML.

[21]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[22]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[23]  Michael J. Pazzani,et al.  Learning from hotlists and coldlists: towards a WWW information filtering and seeking agent , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[24]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.