Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text

For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from economics (EconBiz). In these datasets, the titles and annotations of millions of publications are available, and they outnumber the available full-texts by a factor of 20 and 15, respectively. To exploit these large amounts of data to their full potential, we develop three strong deep learning classifiers and evaluate their performance on the two datasets. The results are promising. On the EconBiz dataset, all three classifiers outperform their full-text counterparts by a large margin. The best title-based classifier outperforms the best full-text method by 9.4%. On the PubMed dataset, the best title-based method almost reaches the performance of the best full-text classifier, with a difference of only 2.9%.

[1]  Ansgar Scherp,et al.  Word Embeddings for Practical Information Retrieval , 2017, GI-Jahrestagung.

[2]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[3]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[4]  Ansgar Scherp,et al.  Profiling vs. time vs. content: What does matter for top-k publication recommendation based on Twitter profiles? , 2016, 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL).

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Bradley M. Hemminger,et al.  Comparison of full-text searching to metadata searching for genes in two biomedical literature cohorts , 2007 .

[7]  Ansgar Scherp,et al.  Using Titles vs. Full-text as Source for Automated Semantic Document Annotation , 2017, K-CAP.

[8]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[9]  Wenpeng Yin,et al.  Multichannel Variable-Size Convolution for Sentence Classification , 2015, CoNLL.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Dina Demner-Fushman,et al.  Recent Enhancements to the NLM Medical Text Indexer , 2014, CLEF.

[12]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[13]  Timothy N. Rubin,et al.  Statistical topic models for multi-label document classification , 2011, Machine Learning.

[14]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[16]  Marcos André Gonçalves,et al.  A source independent framework for research paper recommendation , 2011, JCDL '11.

[17]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[18]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[19]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[20]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[21]  Ansgar Scherp,et al.  A Comparison of Different Strategies for Automated Semantic Document Annotation , 2015, K-CAP.

[22]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[23]  Alexandre Denis,et al.  Do Convolutional Networks need to be Deep for Text Classification ? , 2017, AAAI Workshops.

[24]  Johannes Fürnkranz,et al.  Large-Scale Multi-label Text Classification - Revisiting Neural Networks , 2013, ECML/PKDD.

[25]  Yann LeCun,et al.  Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[26]  Rui Zhang,et al.  Dependency Sensitive Convolutional Neural Networks for Modeling Sentences and Documents , 2016, NAACL.

[27]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[28]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[29]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[30]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[31]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[32]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[33]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  ChengXiang Zhai,et al.  MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence , 2015, Bioinform..

[36]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[37]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  Hongyuan Zha,et al.  Deep Extreme Multi-label Learning , 2017, ICMR.

[41]  Eamon Duede,et al.  Amplifying the impact of open access: Wikipedia and the diffusion of science , 2015, J. Assoc. Inf. Sci. Technol..

[42]  Wang Ling,et al.  Generative and Discriminative Text Classification with Recurrent Neural Networks , 2017, ArXiv.

[43]  Geoffrey I. Webb,et al.  On the effect of data set size on bias and variance in classification learning , 1999 .

[44]  Kyunghyun Cho,et al.  Efficient Character-level Document Classification by Combining Convolution and Recurrent Layers , 2016, ArXiv.

[45]  Zhiyong Lu,et al.  MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank , 2017, Journal of Biomedical Semantics.

[46]  Zhiyong Lu,et al.  Recommending MeSH terms for annotating biomedical articles , 2011, J. Am. Medical Informatics Assoc..

[47]  Bradley M. Hemminger,et al.  Comparison of full-text searching to metadata searching for genes in two biomedical literature cohorts , 2007, J. Assoc. Inf. Sci. Technol..