Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine

As the document collection and user population increase, the capability and performance of a digital library such as CiteSeerX maybe limited by some bottlenecks. This paper describes the current infrastructure of the CiteSeerX academic digital library search engine, outlines its current bottlenecks and proposes feasible solutions. These bottlenecks exist in various components of the system including hardware, web crawling, text extraction and storage. The hardware bottleneck is the increasing difficulty to maintain a cluster consisting of almost twenty physical servers. The solution is to merge some servers and implement the whole system under a virtual architecture. The web crawling bottleneck is that the seed URLs are biased on on computer science, information sciences, technology and related fields. One of the approaches to balance the domain distribution of our crawl repository, is to obtain seed URLs from generic search engines. Another bottleneck is the average time to extract text from the crawled documents. To reduce the processing time, we have proposed a new extraction model using message queues and multiple threads. Preliminary experiments indicates that the average time to extract a document can be reduced by an order of magnitude. The storage bottleneck is that as the data repository size grows, a better tool is required to manage the storage, transferring, sharing and backing up of our files. Hadoop provides a promising tool to parallelize data analysis and the Hadoop File System provides a solution for shared storage. All solutions to the current bottlenecks are either under testing or on our roadmap. Our indexing protocol does not have foreseeable bottlenecks in the near future.

[1]  Gultekin Özsoyoglu,et al.  Scalability of Databases for Digital Libraries , 2005, ICADL.

[2]  Feng Zhao,et al.  Virtual machine power metering and provisioning , 2010, SoCC '10.

[3]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[5]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[6]  Madian Khabsa,et al.  Web crawler middleware for search engine digital libraries: a case study for citeseerX , 2012, WIDM '12.

[7]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8]  Yang Song,et al.  CiteSeerχ: a scalable autonomous scientific digital library , 2006, InfoScale '06.

[9]  C. Lee Giles,et al.  The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists , 2012, WebSci '12.

[10]  Pradeep B. Teregowda Computational Issues in Digital Library Search Engines , 2012 .

[11]  Thu D. Nguyen,et al.  Reducing electricity cost through virtual machine placement in high performance computing clouds , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Clement T. Yu,et al.  A highly scalable and effective method for metasearch , 2001, TOIS.