Joint Optimization of Index Freshness and Coverage in Real-Time Search Engines

Real-time search engines are increasingly indexing web content using data streams, since a number of web sources including news and social media sites are now delivering up-to-date information via streams. Accordingly, it is a crucial challenge for a real-time search engine using data streams to improve index freshness that primarily depends on the latencies involved during fetching and indexing processes. Retrieval latency is a time lag between document publication and fetching while indexing latency is a delay required for a fetched document to be indexed, which is caused by finiteness of indexing capacity. The problem of retrieval latency can be satisfactorily addressed by use of appropriate fetching scheduling or recent real-time content notification protocols. However, as the entire volume of real-time content rapidly grows, the indexing latency becomes a challenging problem. Furthermore, the need for maximizing index coverage makes it more difficult to reduce the indexing latency under the limited indexing capacity. We consider a problem of jointly optimizing the indexing latency as well as indexindexing latency coverage, in which their relative importance can be adjusted, and propose an optimization model based on inventory control theory. Extensive experiments have been conducted to validate the proposed model, and suggest that the proposed approach outperforms the other alternatives.

[1]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  K. Mak A production lot size inventory model for deteriorating items , 1982 .

[3]  S. K. Goyal,et al.  Recent trends in modeling of deteriorating inventory , 2001, Eur. J. Oper. Res..

[4]  Philip M. Wolfe,et al.  An inventory model for deteriorating items , 1991 .

[5]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[6]  Peter Saint-Andre XMPP: lessons learned from ten years of XML messaging , 2009, IEEE Communications Magazine.

[7]  Peter Saint-Andre Extensible Messaging and Presence Protocol (XMPP): Core , 2011, RFC.

[8]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[9]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[10]  Jonghun Park,et al.  Searching Social Media Streams on the Web , 2010, IEEE Intelligent Systems.

[11]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[12]  Mohamad Y. Jaber,et al.  Periodic review (s, S) inventory model with permissible delay in payments , 2004, J. Oper. Res. Soc..

[13]  Fred Raafat,et al.  Survey of Literature on Continuously Deteriorating Inventory Models , 1991 .

[14]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[15]  C. Castillo,et al.  Crawling the Web with Limited Memory , 2006 .

[16]  David Geer Is It Really Time for Real-Time Search? , 2010, Computer.

[17]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[18]  Bernard J. Jansen,et al.  Real time search user behavior , 2010, CHI EA '10.

[19]  Vasileios Kandylas,et al.  Improving web search relevance and freshness with content previews , 2010, CIKM.

[20]  Katta G. Murty,et al.  Nonlinear Programming Theory and Algorithms , 2007, Technometrics.

[21]  Rynson W. H. Lau,et al.  Knowledge and Data Engineering for e-Learning Special Issue of IEEE Transactions on Knowledge and Data Engineering , 2008 .

[22]  D. M. Hutton,et al.  Web Dynamics - Adapting to Change in Content, Size, Topology and Use , 2006 .

[23]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[24]  Christopher C. Yang Search Engines Information Retrieval in Practice , 2010, J. Assoc. Inf. Sci. Technol..

[25]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[26]  Gongzhu Hu,et al.  A distributed platform for archiving and retrieving RSS feeds , 2005, Fourth Annual ACIS International Conference on Computer and Information Science (ICIS'05).

[27]  Jinxing Xie,et al.  A note on "Two-warehouse inventory model with deterioration under FIFO dispatch policy" , 2008, Eur. J. Oper. Res..

[28]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[29]  Banu Yüksel Özkaya,et al.  Analysis of the (s, S) policy for perishables with a random shelf life , 2008 .

[30]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[31]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.