论文信息 - On the Use of ArXiv as a Dataset - 字舞流文

On the Use of ArXiv as a Dataset

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other metadata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXiv's publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.

Alexander A. Alemi | Matthew Bierbaum | Colin B. Clement | Kevin P. O'Keeffe | Alexander A. Alemi | Matthew Bierbaum | K. O’Keeffe

[1] Quoc V. Le,et al. Document Embedding with Paragraph Vectors , 2015, ArXiv.

[2] Herbert Van de Sompel,et al. The open archives initiative: building a low-barrier interoperability framework , 2001, JCDL '01.

[3] David Liben-Nowell,et al. The link-prediction problem for social networks , 2007 .

[4] Siew Ann Cheong,et al. Using Machine Learning to Predict the Evolution of Physics Research , 2018, ArXiv.

[5] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[7] Michal Lopuszynski,et al. Tagging Scientific Publications Using Wikipedia and Natural Language Processing Tools - Comparison on the ArXiv Dataset , 2013, TPDL Workshops.

[8] Evgeniy Gabrilovich,et al. A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[9] Palash Goyal,et al. Graph Embedding Techniques, Applications, and Performance: A Survey , 2017, Knowl. Based Syst..

[10] Jure Leskovec,et al. Inductive Representation Learning on Large Graphs , 2017, NIPS.

[11] Dalibor Fiala,et al. Network-based statistical comparison of citation topology of bibliographic databases , 2014, Scientific Reports.

[12] Walter Dempsey,et al. Hierarchical network models for structured exchangeable interaction processes , 2019, 1901.09982.

[13] Jure Leskovec,et al. Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[14] Jon M. Kleinberg,et al. Overview of the 2003 KDD Cup , 2003, SKDD.

[15] Razvan Pascanu,et al. Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[16] Iadh Ounis,et al. NTCIR-11 Math-2 Task Overview , 2014, NTCIR.

[17] Nan Hua,et al. Universal Sentence Encoder , 2018, ArXiv.

[18] Alexander A. Alemi,et al. Text Segmentation based on Semantic Word Embeddings , 2015, ArXiv.

[19] Iryna Gurevych,et al. Predicting Research Trends From Arxiv , 2019, ArXiv.