Utility-Based Control Feedback in a Digital Library Search Engine: Cases in CiteSeerX

We describe a utility-based feedback control model and its applications within an open access digital library search engine – CiteSeerX, the new version of CiteSeer. CiteSeerX leverages user-based feedback to correct metadata and reformulate the citation graph. New documents are automatically crawled using a focused crawler for indexing. Those documents that are ingested have their document URLs automatically inspected so as to provide feedback to a whitelist filter, which automatically selects high quality crawl seed URLs. The changing citation count plus the download history of papers is an indicator of ill-conditioned metadata that needs correction. We believe that these feedback mechanisms effectively improve the overall metadata quality and save computational resources. Although these mechanisms are used in the context of CiteSeerX, we believe they can be readily transferred to other similar systems.

[1]  Fabio Gasparetti,et al.  Adaptive Focused Crawling , 2007, The Adaptive Web.

[2]  Adrienne Muir Digital library research , 2001 .

[3]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[4]  C. Lee Giles,et al.  The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists , 2012, WebSci '12.

[5]  M. Lemmon Towards a Passivity Framework for Power Control and Response Time Management in Cloud Computing , 2012 .

[6]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[7]  Soumen Chakrabarti,et al.  Accelerated focused crawling through online relevance feedback , 2002, WWW.

[8]  Joseph L. Hellerstein,et al.  Managing the Performance of Lotus Notes: A Control Theoretic Approach , 2001, Int. CMG Conference.

[9]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[10]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[11]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[12]  Markus Hehn,et al.  A flying inverted pendulum , 2011, 2011 IEEE International Conference on Robotics and Automation.

[13]  Sudatta Chowdhury,et al.  Digital library research: major issues and trends , 1999, J. Documentation.

[14]  Rajarshi Das,et al.  Utility functions in autonomic systems , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[15]  Ioana Moisil,et al.  Advanced AI techniques for web mining , 2008 .

[16]  A. Goldberg General System Theory: Foundations, Development, Applications. , 1969 .

[17]  Oscar H. Ibarra,et al.  Adaptive load sharing for clustered digital library servers , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[18]  Rajarshi Das,et al.  Utility functions in autonomic systems , 2004 .

[19]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[20]  Yixin Diao,et al.  Using MIMO feedback control to enforce policies for interrelated metrics with application to the Apache Web server , 2002, NOMS 2002. IEEE/IFIP Network Operations and Management Symposium. ' Management Solutions for the New Communications World'(Cat. No.02CH37327).

[21]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[22]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[23]  Jeffrey O. Kephart,et al.  An artificial intelligence perspective on autonomic computing policies , 2004, Proceedings. Fifth IEEE International Workshop on Policies for Distributed Systems and Networks, 2004. POLICY 2004..

[24]  Cornelia Caragea,et al.  Specialized Research Datasets in the CiteSeerx Digital Library , 2012, D Lib Mag..

[25]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[26]  Oliver Chiu-sing Choy,et al.  A feedback control circuit design technique to suppress power noise in high speed output driver , 1995, Proceedings of ISCAS'95 - International Symposium on Circuits and Systems.

[27]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[28]  Filippo Menczer,et al.  MySpiders: Evolve Your Own Intelligent Web Crawlers , 2002, Autonomous Agents and Multi-Agent Systems.