A Machine Learning Approach to Building Domain-Speci c Search Engines

Domain-speci c search engines are becoming increasingly popular because they o er increased accuracy and extra features not possible with general, Web-wide search engines. Unfortunately, they are also di cult and timeconsuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-speci c search engines. We describe new research in reinforcement learning, text classi cation and information extraction that enables e cient spidering, populates topic hierarchies, and identi es informative text segments. Using these techniques, we have built a demonstration system: a search engine for computer science research papers available at www.cora.justresearch.com.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[4]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[5]  Filippo Menczer,et al.  ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery , 1997, ICML 1997.

[6]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[7]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[8]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[9]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[10]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[11]  William W. Cohen A Web-based information system that reasons with structured collections of text , 1998, AGENTS '98.

[12]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[13]  Andrew McCallumzy,et al.  Building Domain-speciic Search Engines with Machine Learning Techniques , 1999 .

[14]  Jason D. M. Rennie,et al.  Building Domain-Speci c Search Engines with Machine Learning Techniques , 1999 .

[15]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[16]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[17]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.