Scaling SeerSuite in the Cloud

The Seer Suite digital library search engine framework is used to build tools such as CiteSeerx. It includes a complex metadata extraction system capable of extracting elements, such as author name, title, citations and citation contexts that are crucial bibliometric data and for building a citation graph. The workload faced by the exractor is dynamic in nature and this variability makes CiteSeerx attractive for hosting in a cloud computing environment. Given its application binary dependencies and its reliance on a specialized infrastructure, the current extractor has several limitations. These limitations motivated the design and implementation of the metadata extraction system proposed in this study. A message oriented middleware architecture is used with a publish/subscribe pattern to build a scalable, flexible system that can be deployed across a range of cloud infrastructure. To demonstrate the broad applicability of the proposed system, we evaluate it in terms of its reference implementation across different scenarios of deployment and in regard to its scalability.

[1]  C. Lee Giles,et al.  CiteSeerx: A Cloud Perspective , 2010, HotCloud.

[2]  Wolfgang Emmerich,et al.  Software engineering and middleware: a roadmap , 2000, ICSE '00.

[3]  Robert Fox,et al.  Library in the clouds , 2009, OCLC Syst. Serv..

[4]  Edward Curry,et al.  Message‐Oriented Middleware , 2005 .

[5]  Xiangmin Zhang,et al.  Rule-based word clustering for document metadata extraction , 2005, SAC '05.

[6]  Malcolm P. Atkinson,et al.  A distributed architecture for data mining and integration , 2009, DADC '09.

[7]  Divyakant Agrawal,et al.  Meghdoot: Content-Based Publish/Subscribe over P2P Networks , 2004, Middleware.

[8]  Alexander S. Szalay,et al.  Migrating a (large) science database to the cloud , 2010, HPDC '10.

[9]  Abhishek Chandra,et al.  Early experience with the distributed nebula cloud , 2011, DIDC '11.

[10]  Georg Gottlob,et al.  Scalable Web Data Extraction for Online Market Intelligence , 2009, Proc. VLDB Endow..

[11]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[12]  Bao Lingyun,et al.  Application of Cloud Computing in university library user service model , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[13]  Prasenjit Mitra,et al.  An algorithm search engine for software developers , 2011, SUITE '11.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[16]  Asser N. Tantawi,et al.  See Spot Run: Using Spot Instances for MapReduce Workflows , 2010, HotCloud.

[17]  C. Lee Giles,et al.  CiteSeer x : a cloud perspective , 2010 .

[18]  Muhammad Ali Babar,et al.  Migrating Service-Oriented System to Cloud Computing: An Experience Report , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[19]  Anand Sivasubramaniam,et al.  Workload analysis for scientific literature digital libraries , 2008, International Journal on Digital Libraries.

[20]  Guruduth Banavar,et al.  A Case for Message Oriented Middleware , 1999, DISC.

[21]  Ian Sommerville,et al.  Decision Support Tools for Cloud Migration in the Enterprise , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[22]  C. Lee Giles,et al.  Cloud Computing: A Digital Libraries Perspective , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[23]  Ian Taylor,et al.  Service-oriented middleware for hybrid environments , 2006, ADPUC '06.

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[26]  Sebastian Blohm,et al.  Large-scale pattern-based information extraction from the world wide web , 2011 .

[27]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..