CiteSeerX: 20 years of service to scholarly big data

We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make CiteSeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful.

[1]  Wei Zhong,et al.  Structural Similarity Search for Formulas Using Leaf-Root Paths in Operator Subtrees , 2019, ECIR.

[2]  C. Lee Giles,et al.  Identifying table boundaries in digital documents via sparse line detection , 2008, CIKM '08.

[3]  C. Lee Giles,et al.  HESDK: A Hybrid Approach to Extracting Scientific Domain Knowledge Entities , 2017, 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL).

[4]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[5]  Cornelia Caragea,et al.  Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach , 2014, EMNLP.

[6]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[7]  Christopher Andreas Clark,et al.  PDFFigures 2.0: Mining figures from research papers , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[8]  C. Lee Giles,et al.  A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents , 2017, AAAI.

[9]  Marco Gori,et al.  Towards Next Generation CiteSeer: A Flexible Architecture for Digital Library Deployment , 2006, ECDL.

[10]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[11]  Zhaohui Wu,et al.  Migrating a Digital Library to a Private Cloud , 2014, 2014 IEEE International Conference on Cloud Engineering.

[12]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[13]  Madian Khabsa,et al.  SeerSuite: Developing a Scalable and Reliable Application Framework for Building Digital Libraries by Crawling the Web , 2010, WebApps.

[14]  Yang Song,et al.  Generative models for name disambiguation , 2007, WWW '07.

[15]  Cornelia Caragea,et al.  Document Type Classification in Online Digital Libraries , 2016, AAAI.

[16]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[17]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[18]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[19]  C. Lee Giles,et al.  Efficient Name Disambiguation for Large-Scale Databases , 2006, PKDD.

[20]  Mohammad Al Hasan,et al.  Name Disambiguation in Anonymized Graphs using Network Embedding , 2017, CIKM.

[21]  C. Lee Giles,et al.  Disambiguating authors in academic publications using random forests , 2009, JCDL '09.

[22]  Cornelia Caragea,et al.  Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets , 2019, AAAI.

[23]  Madian Khabsa,et al.  Online Person Name Disambiguation with Constraints , 2015, JCDL.

[24]  C. Lee Giles,et al.  CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[25]  C. Lee Giles,et al.  A Supervised Learning Approach To Entity Matching Between Scholarly Big Datasets , 2017, K-CAP.

[26]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[27]  Madian Khabsa,et al.  Scholarly big data information extraction and integration in the CiteSeerχ digital library , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[28]  Madian Khabsa,et al.  Chemical entity extraction using CRF and an ensemble of extractors , 2015, Journal of Cheminformatics.

[29]  C. Lee Giles,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[30]  C. Lee Giles,et al.  Automatic Detection of Pseudocodes in Scholarly Documents Using Machine Learning , 2013, 2013 12th International Conference on Document Analysis and Recognition.