CiteSeerX: AI in a Digital Library Search Engine

CiteSeerX is a digital library search engine providing access to more than five million scholarly documents with nearly a million users and millions of hits per day. We present key AI technologies used in the following components: document classification and de-duplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5–6 years. We show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. We also present AI technologies implemented in table and algorithm search, which are special search modes in CiteSeerX. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.

[1]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[2]  Cornelia Caragea,et al.  CiteSeer x : A Scholarly Big Dataset , 2014, ECIR.

[3]  C. Lee Giles,et al.  Near duplicate detection in an academic digital library , 2013, ACM Symposium on Document Engineering.

[4]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[5]  Kun Bai,et al.  Automatic extraction of table metadata from digital documents , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[6]  C. Lee Giles,et al.  Cloud Computing: A Digital Libraries Perspective , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[10]  C. Lee Giles,et al.  Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing , 2004, Proc. Natl. Acad. Sci. USA.

[11]  Madian Khabsa,et al.  A Web Service for Scholarly Big Data Information Extraction , 2014, 2014 IEEE International Conference on Web Services.

[12]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[13]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[14]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[15]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[16]  C. Lee Giles,et al.  The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists , 2012, WebSci '12.

[17]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[18]  Madian Khabsa,et al.  AckSeer: a repository and search engine for automatically extracted acknowledgments from digital libraries , 2012, JCDL '12.

[19]  Cornelia Caragea,et al.  Specialized Research Datasets in the CiteSeerx Digital Library , 2012, D Lib Mag..

[20]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[21]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[22]  Matthias Jarke,et al.  Development of computer science disciplines: a social network analysis approach , 2011, Social Network Analysis and Mining.

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[24]  Prasenjit Mitra,et al.  An algorithm search engine for software developers , 2011, SUITE '11.

[25]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[26]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[27]  Hung-Hsuan Chen,et al.  CSSeer: an expert recommendation system based on CiteseerX , 2013, JCDL '13.

[28]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[29]  Zhaohui Wu,et al.  Migrating a Digital Library to a Private Cloud , 2014, 2014 IEEE International Conference on Cloud Engineering.

[30]  C. Lee Giles,et al.  Building a Search Engine for Algorithms , 2014 .

[31]  Padhraic Smyth,et al.  Analysis and Visualization of Network Data using JUNG , 2005 .

[32]  Marcos André Gonçalves,et al.  FLUX-CIM: flexible unsupervised extraction of citation metadata , 2007, JCDL '07.

[33]  Xiaolong Zhang,et al.  CollabSeer: a search engine for collaboration discovery , 2011, JCDL '11.

[34]  Cornelia Caragea,et al.  Automatic Identification of Research Articles from Crawled Documents , 2014, WSDM 2014.

[35]  Dror G. Feitelson,et al.  Predictive ranking of computer scientists using CiteSeer data , 2004, J. Documentation.

[36]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[37]  Lior Rokach,et al.  A figure search engine architecture for a chemistry digital library , 2013, JCDL '13.