Computational Issues in Digital Library Search Engines

The open source SeerSuite digital library search engine framework, has been utilized in building several information systems such as CiteSeerX, ChemXSeer and recommender systems. The framework, its components and data collected from its instances have been utilized by researchers around the world, in topics related to information systems, machine learning, software design and systems. We describe the architecture and workflow of SeerSuite. Several features of SeerSuite instances such as dynamic workloads imposed by web traffic and document acquisition systems, maintenance requirements make cloud computing an attractive option. We discuss in detail the economics and impact of hosting instances of SeerSuite in the cloud. The framework depends on a complex metadata extraction system, capable of extracting crucial entities such as author, title, citations and their contexts necessary for building citation graphs, to link and rank documents in the collection. However, the current metadata extractor, suffers several limitations as a result of dependencies in the program, code complexity, lack of parallelization and a need for specialized infrastructure in the form of large scale shared storage. These limitations prevent the SeerSuite framework from scaling to supporting large collections. In this dissertation, we discuss the design and implementation of a metadata extraction system. Specifically, we present details of a scalable and portable system built using message oriented middleware architecture with a publish/subscribe approach and can be deployed across different physical and cloud infrastructure. Experimental results indicate the throughput of the extraction system can be increased by order of several factors. A discussion of lessons learned from our experiences in building and deploying the metadata extraction system, especially those related to scaling, reliability and cloud deployments is provided.

[1]  Anne-Marie Kermarrec,et al.  The many faces of publish/subscribe , 2003, CSUR.

[2]  Mudhakar Srivatsa,et al.  Search-as-a-service: Outsourced search over outsourced storage , 2009, TWEB.

[3]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[4]  C. Lee Giles,et al.  Metadata extraction and indexing for map search in web documents , 2008, CIKM '08.

[5]  Guruduth Banavar,et al.  A Case for Message Oriented Middleware , 1999, DISC.

[6]  Wang-Chien Lee,et al.  CiteSeerx: an architecture and web service design for an academic document search engine , 2006, WWW '06.

[7]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[8]  Steve Vinoski REST Eye for the SOA Guy , 2007, IEEE Internet Computing.

[9]  Wolfgang Emmerich,et al.  Software engineering and middleware: a roadmap , 2000, ICSE '00.

[10]  Robert L. Grossman,et al.  Data mining using high performance data clouds: experimental studies using sector and sphere , 2008, KDD.

[11]  Ian Sommerville,et al.  Decision Support Tools for Cloud Migration in the Enterprise , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[12]  Dejan S. Milojicic,et al.  Open Cirrus TM cloud computing testbed: federated data centers for open source systems and services research , 2009, CloudCom 2009.

[13]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[14]  Paul Ginsparg,et al.  Can Peer Review Be Better Focused? , 2002 .

[15]  C. Lee Giles,et al.  ChemXSeer: a digital library and data repository for chemical kinetics , 2007, CIMS '07.

[16]  Xiangmin Zhang,et al.  Rule-based word clustering for document metadata extraction , 2005, SAC '05.

[17]  Prasenjit Mitra,et al.  An algorithm search engine for software developers , 2011, SUITE '11.

[18]  Arun Venkataramani,et al.  Disaster Recovery as a Cloud Service: Economic Benefits & Deployment Challenges , 2010, HotCloud.

[19]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[20]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[21]  R. Fielding,et al.  Architectural Styles and the Design of Network-based Software Architectures (CHAPTER 5) , 2000 .

[22]  Edward Curry,et al.  1 Message-Oriented Middleware , 2004 .

[23]  Walter Brisken,et al.  To Lease or Not to Lease from Storage Clouds , 2010, Computer.

[24]  E Garfield,et al.  "Science Citation Index"--A New Dimension in Indexing. , 1964, Science.

[25]  Divyakant Agrawal,et al.  Meghdoot: Content-Based Publish/Subscribe over P2P Networks , 2004, Middleware.

[26]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[27]  Jöran Beel,et al.  SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size) , 2010, ECDL.

[28]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[29]  Marco Gori,et al.  Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment , 2006, ECDL.

[30]  Jun Wang,et al.  SCM-Oriented Dynamic Service Architecture and Collaborative Application for Internet of Things , 2012 .

[31]  Dalibor Fiala,et al.  Mining citation information from CiteSeer data , 2011, Scientometrics.

[32]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[33]  Robert Wilensky,et al.  A framework for distributed digital object services , 2006, International Journal on Digital Libraries.

[34]  Carlos André Guimarães Ferraz,et al.  A message-oriented middleware for sensor networks , 2004, MPAC '04.

[35]  Ian Taylor,et al.  Service-oriented middleware for hybrid environments , 2006, ADPUC '06.

[36]  László Böszörményi,et al.  A survey of Web cache replacement strategies , 2003, CSUR.

[37]  Warren Smith An information architecture based on publish/subscribe messaging , 2011 .

[38]  C. Lee Giles,et al.  CiteSeerx: A Cloud Perspective , 2010, HotCloud.

[39]  GhemawatSanjay,et al.  The Google file system , 2003 .

[40]  C. Lee Giles,et al.  Cloud Computing: A Digital Libraries Perspective , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[41]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[42]  Prasenjit Mitra,et al.  Utilizing Context in Generative Bayesian Models for Linked Corpus , 2010, AAAI.

[43]  Dror G. Feitelson,et al.  Predictive ranking of computer scientists using CiteSeer data , 2004, J. Documentation.

[44]  Jim Gray,et al.  Distributed Computing Economics , 2004, ACM Queue.

[45]  Sandip Debnath,et al.  Learning metadata from the evidence in an on-line citation matching scheme , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[46]  Anand Sivasubramaniam,et al.  Workload analysis for scientific literature digital libraries , 2008, International Journal on Digital Libraries.

[47]  Peter Mika,et al.  Web Semantics in the Clouds , 2008, IEEE Intelligent Systems.

[48]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[49]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[50]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[51]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.