Machine learning, data mining, and the World Wide Web : design of special-purpose search engines

We present DEADLINER, a special-purpose search engine that indexes conference and workshop announcements, and which extracts a range of academic information from the Web. SVMs provide an efficient and highly accurate mechanism for obtaining relevant web documents. DEADLINER currently extracts speakers, locations (e.g. countries), dates, paper submission (and other) deadlines, topics, program committees, abstracts, and affiliations. Complex and detailed searches are possible on these fields. The niche search engine was constructed by employing a methodology for rapid implementation of specialised search engines. Bayesian integration of simple extractors provides this methodology, that avoids complex hand-tuned text extraction methods. The simple extractors exploit loose formatting and keyword conventions. The Bayesian framework further produces a search engine where each user can control each fields false alarm rate in an intuitive and rigorous fashion, thus providing easy-to-use metadata.

[1]  Shih-Fu Chang,et al.  MetaSEEk: a content-based metasearch engine for images , 1997, Electronic Imaging.

[2]  J. Romano,et al.  Early diagnosis of carpal tunnel syndrome: comparison of digit 1 with wrist and distoproximal ratio. , 2001, Neurology & Clinical Neurophysiology.

[3]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[4]  C. Lee Giles,et al.  Indexing and retrieval of scientific literature , 1999, CIKM '99.

[5]  William P. Birmingham,et al.  Architecture of a metasearch engine that supports user information needs , 1999, CIKM '99.

[6]  Dunja Mladeni,et al.  Text-learning and related intelligent agentsDunja , 1999 .

[7]  David M. Pennock,et al.  Persistence of information on the web: analyzing citations contained in research articles , 2000, CIKM '00.

[8]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.

[9]  Tzi-cker Chiueh,et al.  SASE: Implementation of a Compressed Text Search Engine , 1997, USENIX Symposium on Internet Technologies and Systems.

[10]  C. Lee Giles,et al.  Context and Page Analysis for Improved Web Search , 1998, IEEE Internet Comput..

[11]  Chun-Nan Hsu,et al.  Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules , 1998 .

[12]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[13]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[14]  Giles,et al.  Searching the world wide Web , 1998, Science.

[15]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[16]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[17]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[18]  C. Lee Giles,et al.  Text and Image Metasearch on the Web , 1999, PDPTA.

[19]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[20]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[21]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[22]  Shih-Fu Chang,et al.  A fully automated content-based video search engine supporting spatiotemporal queries , 1998, IEEE Trans. Circuits Syst. Video Technol..

[23]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[24]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[25]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[26]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[27]  Filippo Menczer,et al.  ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery , 1997, ICML 1997.

[28]  Shlomo Moran,et al.  The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[29]  Stefan Eickeler,et al.  Content-based video indexing of TV broadcast news using hidden Markov models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[30]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[31]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .

[32]  Doug Beeferman,et al.  Agglomerative clustering of a search engine query log , 2000, KDD '00.

[33]  David D. Clark,et al.  The design philosophy of the DARPA internet protocols , 1988, SIGCOMM '88.

[34]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[35]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[36]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[37]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[38]  Alberto O. Mendelzon,et al.  Database techniques for the World-Wide Web: a survey , 1998, SGMD.

[39]  Oliver A. McBryan,et al.  GENVL and WWWW: Tools for taming the Web , 1994, WWW Spring 1994.

[40]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[41]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[42]  Henry Lieberman,et al.  Let's browse: a collaborative Web browsing agent , 1998, IUI '99.

[43]  David Eichmann,et al.  The RBSE spider — Balancing effective search against Web load , 1994, WWW Spring 1994.

[44]  James T. Kwok,et al.  Automated Text Categorization Using Support Vector Machine , 1998, ICONIP.

[45]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[46]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[47]  Richard M. Schwartz,et al.  BBN at TREC7: Using Hidden Markov Models for Information Retrieval , 1998, TREC.

[48]  Arun Iyengar,et al.  A scalable system for consistently caching dynamic Web data , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[49]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[50]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[51]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[52]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[53]  Craig A. Knoblock,et al.  Retrieving and Integrating Data from Multiple Information Sources , 1993, Int. J. Cooperative Inf. Syst..

[54]  Craig A. Knoblock,et al.  STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources * , 1998 .

[55]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[56]  Paul R. Cohen,et al.  Learning Regular Languages from Positive Evidence , 1998 .

[57]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[58]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[59]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[60]  J. Glenn Brookshear,et al.  Theory of Computation: Formal Languages, Automata, and Complexity , 1989 .

[61]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[62]  C. Lee Giles,et al.  DEADLINER: building a new niche search engine , 2000, CIKM '00.

[63]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[64]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[65]  Steve Lawrence,et al.  Context in Web Search , 2000, IEEE Data Eng. Bull..

[66]  C. Lee Giles,et al.  Bayesian Classification and Feature Selection from Finite Data Sets , 2000, UAI.

[67]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[68]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[69]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.