Unsupervised Neural Categorization for Scientific Publications

Most conventional document categorization methods require a large number of documents with labeled categories for training. These methods are hard to be applied in scenarios, such as scientific publications, where training data is expensive to obtain and categories could change over years and across domains. In this work, we propose UNEC, an unsupervised representation learning model that directly categories documents without the need of labeled training data. Specifically, we develop a novel cascade embedding approach. We first embed concepts, i.e., significant phrases mined from scientific publications, into continuous vectors, which capture concept semantics. Based on the concept similarity graph built from the concept embedding, we further embed concepts into a hidden category space, where the category information of concepts becomes explicit. Finally we categorize documents by jointly considering the category attribution of their concepts. Our experimental results show that UNEC significantly outperforms several strong baselines on a number of real scientific corpora, under both automatic and manual evaluation.

[1]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[2]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[3]  Omer Levy,et al.  Dependency-Based Word Embeddings , 2014, ACL.

[4]  Mark Ware,et al.  The STM report: An overview of scientific and scholarly journal publishing fourth edition , 2015 .

[5]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[6]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[7]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[8]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[9]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[10]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[11]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[12]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[13]  Hinrich Schütze,et al.  Introduction to Information Retrieval: Scoring, term weighting, and the vector space model , 2008 .

[14]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[16]  Jian Pei,et al.  Asymmetric Transitivity Preserving Graph Embedding , 2016, KDD.

[17]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[18]  Bin Cui,et al.  LDA*: A Robust and Large-scale Topic Modeling System , 2017, Proc. VLDB Endow..

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Utpal Garain,et al.  Using Word Embeddings for Automatic Query Expansion , 2016, ArXiv.

[22]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[24]  Heng Ji,et al.  Automatic Entity Recognition and Typing in Massive Text Data , 2016, SIGMOD Conference.

[25]  Yeye He,et al.  Discovering Enterprise Concepts Using Spreadsheet Tables , 2017, KDD.

[26]  M. W Gardner,et al.  Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences , 1998 .

[27]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.