Enhancing digital libraries using missing content analysis

This work shows how the content of a digital library can be enhanced to better satisfy its users' needs. Missing content is identified by finding missing content topics in the system's query log or in a pre-defined taxonomy of required knowledge. The collection is then enhanced with new relevant knowledge, which is extracted from external sources that satisfy those missing content topics. Experiments we conducted measure the precision of the system before and after content enhancement. The results demonstrate a significant improvement in the system effectiveness as a result of content enhancement and the superiority of the missing content enhancement policy over several other possible policies.

[1]  George Buchanan,et al.  A generic alerting service for digital libraries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[2]  Josiane Mothe,et al.  Linguistic features to predict query difficulty , 2005, SIGIR 2005.

[3]  David G. Stork,et al.  Pattern Classification , 1973 .

[4]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[5]  David M. Pennock,et al.  The structure of broad topics on the web , 2002, WWW.

[6]  Gautam Pant,et al.  Panorama: extending digital libraries with topical crawlers , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[7]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[8]  C. Lee Giles,et al.  What's there and what's not?: focused crawling for missing documents in digital libraries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[9]  Ingemar J. Cox,et al.  On ranking the effectiveness of searches , 2006, SIGIR.

[10]  Carl Lagoze,et al.  Focused Crawls, Tunneling, and Digital Libraries , 2002, ECDL.

[11]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[12]  W. Scott Spangler,et al.  Knowledge base maintenance using knowledge gap analysis , 2001, KDD '01.

[13]  Hsinchun Chen,et al.  Digital Libraries: Technology and Management of Indigenous Knowledge for Global Access , 2004, Lecture Notes in Computer Science.

[14]  Daniel Gruhl,et al.  The web beyond popularity: a really simple system for web scale RSS , 2006, WWW '06.

[15]  Elad Yom-Tov,et al.  Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval , 2005, SIGIR '05.

[16]  W. Bruce Croft,et al.  Ranking robustness: a novel framework to predict query performance , 2006, CIKM '06.

[17]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[18]  Paul Over,et al.  TREC-7 Interactive Track Report , 1998, TREC.

[19]  Minsoo Lee,et al.  A Knowledge Network Approach for Building Distributed Digital Libraries , 2003, ICADL.

[20]  Ophir Frieder,et al.  Hourly analysis of a very large topically categorized web query log , 2004, SIGIR '04.

[21]  Larry Fitzpatrick,et al.  Automatic feedback using past queries: social searching? , 1997, SIGIR '97.

[22]  Elad Yom-Tov,et al.  What makes a query difficult? , 2006, SIGIR.

[23]  Emin Gün Sirer,et al.  Corona: A High Performance Publish-Subscribe System for the World Wide Web , 2006, NSDI.

[24]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.