CiteSeerX data: semanticizing scholarly papers

Scholarly big data is, for many, an important instance of Big Data. Digital library search engines have been built to acquire, extract, and ingest large volumes of scholarly papers. This paper provides an overview of the scholarly big data released by CiteSeerX, as of the end of 2015, and discusses various aspects such as how the data is acquired, its size, general quality, data management, and accessibility. Preliminary results on extracting semantic entities from body text of scholarly papers with Wikifier show biases towards general terms appearing in Wikipedia and against domain specific terms. We argue that the latter will play a more important role in extracting important facts from scholarly papers.

[1]  Avirup Sil,et al.  Re-ranking for joint named-entity recognition and linking , 2013, CIKM.

[2]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[3]  Maarten de Rijke,et al.  Semanticizing search engine queries: the University of Amsterdam at the ERD 2014 challenge , 2014, ERD '14.

[4]  Wenyi Huang,et al.  RefSeer: A citation recommendation system , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[5]  Amit P. Sheth,et al.  SwetoDblp ontology of Computer Science publications , 2007, J. Web Semant..

[6]  C. Lee Giles,et al.  The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists , 2012, WebSci '12.

[7]  Dan Roth,et al.  Relational Inference for Wikification , 2013, EMNLP.

[8]  C. Lee Giles,et al.  Near duplicate detection in an academic digital library , 2013, ACM Symposium on Document Engineering.

[9]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[10]  Jöran Beel,et al.  Evaluation of header metadata extraction approaches and tools for scientific PDF documents , 2013, JCDL '13.

[11]  Yuan An,et al.  Characterizing and Mining Citation Graph of Computer Science Literature , 2001 .

[12]  Yang Li,et al.  Mining evidences for named entity disambiguation , 2013, KDD.

[13]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[14]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.

[15]  Andrew McCallum,et al.  A Discriminative Hierarchical Model for Fast Coreference at Large Scale , 2012, ACL.

[16]  Michael Szell,et al.  A century of physics , 2015, Nature Physics.

[17]  Madian Khabsa,et al.  Large scale author name disambiguation in digital libraries , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[18]  Cornelia Caragea,et al.  Document Type Classification in Online Digital Libraries , 2016, AAAI.

[19]  Heng Ji,et al.  Entity linking for biomedical literature , 2014, DTMBIO '14.

[20]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[21]  Cornelia Caragea,et al.  PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search , 2015, K-CAP.

[22]  Hung-Hsuan Chen,et al.  CSSeer: an expert recommendation system based on CiteseerX , 2013, JCDL '13.

[23]  Roberto Tedesco,et al.  Semanticizing Syntactic Patterns in NLP Processing Using SPARQL-DL Queries , 2012, OWLED.

[24]  C. Lee Giles,et al.  ParsCit: an Open-source CRF Reference String Parsing Package , 2008, LREC.

[25]  Patrice Lopez,et al.  GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications , 2009, ECDL.

[26]  Michael Ley,et al.  DBLP - Some Lessons Learned , 2009, Proc. VLDB Endow..

[27]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[28]  Dalibor Fiala,et al.  Network-based statistical comparison of citation topology of bibliographic databases , 2014, Scientific Reports.

[29]  Yang Song,et al.  An Overview of Microsoft Academic Service (MAS) and Applications , 2015, WWW.