Relevant Sources of Information Are Not Necessarily Popular Ones

The constant growth of the Web in recent years has made more difficult the discovery of new sources of information on a given topic. This is a prominent problem for Experts in Intelligence Analysis (EIA) who are faced to the search of pages on specific and sensitive topics. Because of their lack of popularity or because they are poorly indexed due to their sensitive content, these pages are hard-to-find with traditional search engines. In this article, we describe a new Web source discovery system called DOWSER (Discovery Of Web Sources Evaluating Relevance). The goal of this system is to provide users with new sources of information related to their needs without considering the popularity of a page unlike classic Information Retrieval tools. The expected result is a balance between relevance and originality, in the sense that the wanted pages are not necessary popular. DOWSER is based on a user profile to focus its exploration of the Web in order to collect and index only related Web documents. As requests can be insufficient to express sensitive and specific needs, the user's information needs are specified using user's interests represented by DBPedia resources [1] and keywords, both extracted from Web pages provided by the user. A series of experiments provides an empirical evaluation of DOWSER.

[1]  Annika Wærn,et al.  User Involvement in Automatic Filtering: An Experimental Study , 2004, User Modeling and User-Adapted Interaction.

[2]  Wolfgang Nejdl,et al.  Using ODP metadata to personalize search , 2005, SIGIR '05.

[3]  Patrick Giroux,et al.  WebLab: An integration infrastructure to ease the development of multimedia processing applications , 2008 .

[4]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[5]  Christian Bizer,et al.  DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.

[6]  ChengXiang Zhai,et al.  Mining long-term search history to improve search accuracy , 2006, KDD '06.

[7]  Annika Waern,et al.  User Involvement in Automatic Filtering: An Experimental Study , 2004 .

[8]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[9]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[10]  Clement T. Yu,et al.  Personalized Web search for improving retrieval effectiveness , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[12]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[13]  Gabriella Pasi,et al.  Ontology-Based Information Behaviour to Improve Web Search , 2010, Future Internet.

[14]  Ryen W. White,et al.  A study of factors affecting the utility of implicit relevance feedback , 2005, SIGIR '05.

[15]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[16]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[17]  Fabio Gasparetti,et al.  Personalized Search on the World Wide Web , 2007, The Adaptive Web.

[18]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[19]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[20]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[21]  Ed H. Chi,et al.  Using information scent to model user information needs and actions and the Web , 2001, CHI.

[22]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[23]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[24]  Marc Ehrig,et al.  Ontology-focused crawling of Web documents , 2003, SAC '03.

[25]  Robin Burke,et al.  USING CONCEPT HIERARCHIES TO ENHANCE USER QUERIES IN WEB-BASED INFORMATION RETRIEVAL , 2003 .

[26]  Annika Wrn User Involvement in Automatic Filtering: An Experimental Study , 2004 .

[27]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.