hep-th

We apply techniques in natural language processing, computational linguistics, and machine-learning to investigate papers in hep-th and four related sections of the arXiv: hep-ph, hep-lat, gr-qc, and math-ph. All of the titles of papers in each of these sections, from the inception of the arXiv until the end of 2017, are extracted and treated as a corpus which we use to train the neural network Word2Vec. A comparative study of common n-grams, linear syntactical identities, word cloud and word similarities is carried out. We find notable scientific and sociological differences between the fields. In conjunction with support vector machines, we also show that the syntactic structure of the titles in different sub-fields of high energy and mathematical physics are sufficiently different that a neural network can perform a binary classification of formal versus phenomenological sections with 87.1% accuracy, and can perform a finer five-fold classification across all sections with 65.1% accuracy.

[1]  Matt Taddy,et al.  Document Classification by Inversion of Distributed Language Representations , 2015, ACL.

[2]  Yang-Hui He,et al.  Machine-learning the string landscape , 2017 .

[3]  Derek Taylor,et al.  It Was Twenty Years Ago Today , 1987 .

[4]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[5]  Vishnu Jejjala,et al.  Machine learning CICY threefolds , 2018, Physics Letters B.

[6]  M. Ganesalingam,et al.  A fully automatic problem solver with human-style output , 2013, ArXiv.

[7]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[8]  Rafael I. Nepomechie,et al.  Review of AdS/CFT Integrability: An Overview , 2010, Letters in Mathematical Physics.

[9]  Yi-Nan Wang,et al.  Learning non-Higgsable gauge groups in 4D F-theory , 2018, Journal of High Energy Physics.

[10]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[11]  Paul Ginsparg,et al.  A note concerning primary source knowledge , 2016, J. Assoc. Inf. Sci. Technol..

[12]  Thorsten Joachims,et al.  Mapping subsets of scholarly information , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Georgios Balikas,et al.  An empirical study on large scale text classification with skip-gram embeddings , 2016, ArXiv.

[14]  Laure Thompson,et al.  The strange geometry of skip-gram with negative sampling , 2017, EMNLP.

[15]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[16]  Paul Ginsparg,et al.  Preprint Déjà Vu , 2016, The EMBO journal.

[17]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[18]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[19]  Yang-Hui He,et al.  Deep-Learning the Landscape , 2017, 1706.02714.

[20]  A. Strominger,et al.  On BMS invariance of gravitational scattering , 2013, 1312.2229.

[21]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[22]  魏屹东,et al.  Scientometrics , 2018, Encyclopedia of Big Data.

[23]  Fabian Ruehle Evolving neural networks with genetic algorithms to study the string landscape , 2017, 1706.07024.

[24]  B. Nelson,et al.  Vacuum Selection from Cosmology on Networks of String Geometries. , 2017, Physical review letters.

[25]  D. Krefl,et al.  Machine Learning of Calabi-Yau Volumes : arXiv , 2017, 1706.03346.

[26]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[27]  Dimitrios Kartsaklis,et al.  Linguistic Matrix Theory , 2017, Annales de l’Institut Henri Poincaré D.