Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks

Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category information can be used for building faceted search for digital library search engines. This can significantly assist users in narrowing down their search space of relevant documents. Unfortunately, many academic papers do not have such information as part of their metadata. Existing methods for solving this task usually focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using 9 million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro-F1 measure of 0.76 with F1 of individual subject categories ranging from 0.50-0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[3]  Ludo Waltman,et al.  A new methodology for constructing a publication-level classification system of science , 2012, J. Assoc. Inf. Sci. Technol..

[4]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[7]  Carl T. Bergstrom,et al.  The Science of Science , 2018, Science.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Heng Zhang,et al.  Improving short text classification by learning vector representations of both words and hidden topics , 2016, Knowl. Based Syst..

[10]  Nan Hua,et al.  Universal Sentence Encoder for English , 2018, EMNLP.

[11]  Manpreet Kaur,et al.  Neural ParsCit: a deep learning-based reference string parser , 2018, International Journal on Digital Libraries.

[12]  Ludo Waltman,et al.  Citation-based clustering of publications using CitNetExplorer and VOSviewer , 2017, Scientometrics.

[13]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[14]  Yue Zhang,et al.  Target-Dependent Twitter Sentiment Classification with Rich Automatic Features , 2015, IJCAI.

[15]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[17]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Andrew McCallum,et al.  Lexicon Infused Phrase Embeddings for Named Entity Resolution , 2014, CoNLL.

[20]  Rob Koopman,et al.  Clustering articles based on semantic similarity , 2017, Scientometrics.

[21]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[22]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[23]  Toshikazu Fukushima,et al.  Task-oriented world wide web retrieval by document type classification , 1999, CIKM '99.

[24]  Yue Zhang,et al.  Improving Twitter Sentiment Classification Using Topic-Enriched Multi-Prototype Word Embeddings , 2016, AAAI.

[25]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[26]  Tiago P. Peixoto,et al.  A network approach to topic models , 2017, Science Advances.

[27]  Cornelia Caragea,et al.  CiteSeerX: AI in a Digital Library Search Engine , 2014, AI Mag..

[28]  Ralf Krestel,et al.  WELDA: Enhancing Topic Models by Incorporating Local Word Context , 2018, JCDL.

[29]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[30]  Cornelia Caragea,et al.  Document Type Classification in Online Digital Libraries , 2016, AAAI.

[31]  Jian Wu,et al.  CiteSeerX: 20 years of service to scholarly big data , 2019, AIDR.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Christin Seifert,et al.  Understanding the Influence of Hyperparameters on Text Embeddings for Text Classification Tasks , 2017, TPDL.

[34]  C. Lee Giles,et al.  CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  R Klavans,et al.  Accurately identifying topics using text: Mapping PubMed , 2018 .

[37]  Michael L. Nelson,et al.  Predicting Temporal Intention in Resource Sharing , 2015, JCDL.

[38]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[39]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[40]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[41]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[42]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[43]  Wolfgang Glänzel,et al.  Using hybrid methods and ‘core documents’ for the representation of clusters and topics: the astronomy dataset , 2017, Scientometrics.

[44]  Pablo Moscato,et al.  A Gentle Introduction to Memetic Algorithms , 2003, Handbook of Metaheuristics.

[45]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[46]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[47]  Chuan Wu,et al.  Keyphrase Extraction Based on Prior Knowledge , 2018, JCDL.

[48]  Beatrice Alex,et al.  Extracting a Topic Specific Dataset from a Twitter Archive , 2015, TPDL.

[49]  Tim Menzies,et al.  What is wrong with topic modeling? And how to fix it using search-based software engineering , 2016, Inf. Softw. Technol..

[50]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[51]  Peder Olesen Larsen,et al.  The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index , 2010, Scientometrics.