Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition

As more information becomes available on the World Wide Web (there are currently over 4 billion pages covering most areas of human endeavor), it becomes more difficult to provide effective search tools for information access. Today, people access web information through two main kinds of search interfaces: Browsers (clicking and following hyperlinks) and Query Engines (queries in the form of a set of keywords showing the topic of interest). The first process is tentative and time consuming and the second may not satisfy the user because of many inaccurate and irrelevant results. Better support is needed for expressing one's information need and returning high quality search results by web search tools. There appears to be a need for systems that do reasoning under uncertainty and are flexible enough to recover from the contradictions, inconsistencies, and irregularities that such reasoning involves. Active Logic is a formalism that has been developed with real-world applications and their challenges in mind. Motivating its design is the thought that one of the factors that supports the flexibility of human reasoning is that it takes place step-wise, in time. Active Logic is one of a family of inference engines (step-logics) that explicitly reason in time, and incorporate a history of their reasoning as they run. This characteristic makes Active Logic systems more flexible than traditional AI systems and therefore more suitable for commonsense, real-world reasoning. In this report we mainly will survey recent advances in machine learning and crawling problems related to the web. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish information retrieval in the hypertext domain and will summarize the key areas of recent and ongoing research. We will concentrate on topic-specific search engines, focused crawling, and finally will propose an Information Integration Environment, based on the Active Logic framework.

[1]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[2]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[3]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[4]  Sarit Kraus,et al.  Assessing Others' Knowledge and Ignorance , 1989 .

[5]  Luís Torgo,et al.  Regression Using Classification Algorithms , 1997, Intell. Data Anal..

[6]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[7]  Qiang Yang,et al.  Towards a Next-Generation Search Engine , 2000, PRICAI.

[8]  Donald Perlis,et al.  Conversational adequacy: mistakes are the essence , 1998, Int. J. Hum. Comput. Stud..

[9]  Stefano Mizzaro,et al.  Relevance: The Whole History , 1997, J. Am. Soc. Inf. Sci..

[10]  Sougata Mukherjea,et al.  WTMS: a system for collecting and analyzing topic-specific Web information , 2000, Comput. Networks.

[11]  Geert-Jan Houben,et al.  Information Retrieval in Distributed Hypertexts , 1994, RIAO.

[12]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[13]  Steve Lawrence,et al.  Context in Web Search , 2000, IEEE Data Eng. Bull..

[14]  Ron Dolin,et al.  Practical evaluation of IR within automated classification systems , 1999, CIKM '99.

[15]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[16]  Donald Perlis,et al.  Seven Days in the Life of a Robotic Agent , 2002, WRAC.

[17]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[18]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[19]  C. Lee Giles,et al.  CiteSeer: an autonomous Web agent for automatic retrieval and identification of interesting publications , 1998, AGENTS '98.

[20]  Donald Perlis,et al.  Interpreting Presuppositions Using Active Logic: From Contexts to Utterances , 1997, Comput. Intell..

[21]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[22]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.

[23]  Soumen Chakrabarti,et al.  Data mining for hypertext: a tutorial survey , 2000, SKDD.

[24]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[25]  Gerard Salton,et al.  AUTOMATIC INDEXING USING BIBLIOGRAPHIC CITATIONS , 1971 .

[26]  Andrew McCallum,et al.  An Interoperable Multimedia Catalog System for Electronic Commerce. , 2000 .

[27]  Donald Perlis,et al.  Practical Reasoning and Plan Execution with Active Logic , 1999 .

[28]  Soumen Chakrabarti,et al.  Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation , 2002, IEEE Data Eng. Bull..

[29]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[30]  David R. Traum,et al.  Representations of Dialogue State for Domain and Task Independent Meta-Dialogue , 1999, Electron. Trans. Artif. Intell..

[31]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[32]  Oren Etzioni,et al.  Dynamic Reference Sifting: A Case Study in the Homepage Domain , 1997, Comput. Networks.

[33]  Aditya Ghose,et al.  Case-Based BDI Agents: An Effective Approach for Intelligent Search on the web , 1999, AAAI 1999.

[34]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[35]  Jason D. M. Rennie,et al.  Improving Multiclass Text Classification with the Support Vector Machine , 2001 .

[36]  Donald Perlis,et al.  Reasoning situated in time I: basic concepts , 1990, J. Exp. Theor. Artif. Intell..

[37]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[38]  Brigitte Trousse,et al.  Broadway: A Case-Based System for Cooperative Information Browsing on the World-Wide-Web , 1999, Collaboration between Human and Artificial Societies.

[39]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[40]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[41]  Giles,et al.  Searching the world wide Web , 1998, Science.

[42]  李幼升,et al.  Ph , 1989 .

[43]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[44]  Janet L. Kolodner,et al.  Case-Based Reasoning , 1989, IJCAI 1989.

[45]  Soumen Chakrabarti,et al.  Distributed Hypertext Resource Discovery Through Examples , 1999, VLDB.

[46]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[47]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[48]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[49]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[50]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[51]  Donald Perlis,et al.  Active Logics: A Unified Formal Approach to Episodic Reasoning , 1999 .

[52]  Prabhakar Raghavan,et al.  Mining the Link Structure of the World Wide Web , 1998 .

[53]  Mario Lenz,et al.  Case-Based Reasoning: Survey and Future Directions , 1999, XPS.

[54]  Donald Perlis,et al.  Step-logic: reasoning situated in time , 1988 .

[55]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[56]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.