On the Use of ArXiv as a Dataset

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.

[1]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[2]  Herbert Van de Sompel,et al.  The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[3]  David Liben-Nowell,et al.  The link-prediction problem for social networks , 2007 .

[4]  Siew Ann Cheong,et al.  Using Machine Learning to Predict the Evolution of Physics Research , 2018, ArXiv.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[7]  Michal Lopuszynski,et al.  Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools - Comparison on the ArXiv Dataset , 2013, TPDL Workshops.

[8]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[9]  Palash Goyal,et al.  Graph Embedding Techniques, Applications, and Performance: A Survey , 2017, Knowl. Based Syst..

[10]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[11]  Dalibor Fiala,et al.  Network-based statistical comparison of citation topology of bibliographic databases , 2014, Scientific Reports.

[12]  Walter Dempsey,et al.  Hierarchical network models for structured exchangeable interaction processes , 2019, 1901.09982.

[13]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[14]  Jon M. Kleinberg,et al.  Overview of the 2003 KDD Cup , 2003, SKDD.

[15]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[16]  Iadh Ounis,et al.  NTCIR-11 Math-2 Task Overview , 2014, NTCIR.

[17]  Nan Hua,et al.  Universal Sentence Encoder , 2018, ArXiv.

[18]  Alexander A. Alemi,et al.  Text Segmentation based on Semantic Word Embeddings , 2015, ArXiv.

[19]  Iryna Gurevych,et al.  Predicting Research Trends From Arxiv , 2019, ArXiv.