DEADLINER: building a new niche search engine

We present DEADLINER, a search engine that catalogs conference and workshop announcements, and ultimately will monitor and extract a wide range of academic convocation material from the web. The system currently extracts speakers, locations, dates, paper submission (and other) deadlines, topics, program committees, abstracts, and aAEliations. A user or user agent can perform detailed searches on these elds. DEADLINER was constructed using a methodology for rapid implementation of specialized search engines. This methodology avoids complex hand-tuned text extraction solutions, or natural language processing, by Bayesian integration of simple extractors that exploit loose formatting and keyw ord con ventions. The Bayesian framework further produces a search engine where each user can control the false alarm rate on a eld in an intuitive yet rigorous fashion.

[1]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[2]  C. Lee Giles,et al.  Indexing and retrieval of scientific literature , 1999, CIKM '99.

[3]  William P. Birmingham,et al.  Architecture of a metasearch engine that supports user information needs , 1999, CIKM '99.

[4]  A. Gualtierotti H. L. Van Trees, Detection, Estimation, and Modulation Theory, , 1976 .

[5]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[6]  C. Lee Giles,et al.  Bayesian Classification and Feature Selection from Finite Data Sets , 2000, UAI.

[7]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[8]  C. Lee Giles,et al.  Digital Libraries and Autonomous Citation Indexing , 1999, Computer.

[9]  H. V. Trees Detection, Estimation, And Modulation Theory , 2001 .

[10]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[11]  John C. Platt Using Analytic QP and Sparseness to Speed Training of Support Vector Machines , 1998, NIPS.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Jason D. M. Rennie,et al.  Building Domain-Speci c Search Engines with Machine Learning Techniques , 1999 .

[14]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[15]  Harry L. Van Trees,et al.  Detection, Estimation, and Modulation Theory: Radar-Sonar Signal Processing and Gaussian Signals in Noise , 1992 .

[16]  James T. Kwok,et al.  Automated Text Categorization Using Support Vector Machine , 1998, ICONIP.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Michael D. Gordon,et al.  Web Search---Your Way , 2001, CACM.

[19]  Chun-Nan Hsu,et al.  Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules , 1998 .

[20]  Craig A. Knoblock,et al.  STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources * , 1998 .

[21]  C. Lee Giles,et al.  Paper ID : 92 Feature Selection in Web Applications Using ROC Inflections and Power Set Pruning , 2000 .

[23]  Paul R. Cohen,et al.  Learning Regular Languages from Positive Evidence , 1998 .

[24]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[25]  Richard M. Schwartz,et al.  BBN at TREC7: Using Hidden Markov Models for Information Retrieval , 1998, TREC.