论文信息 - Recurrent Neural Network Language Model Adaptation Derived Document Vector

Recurrent Neural Network Language Model Adaptation Derived Document Vector

In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document, and they can be important in some NLP tasks such as genre classification. This paper proposes a novel distributed vector representation of a document: a simple recurrent-neural-network language model (RNN-LM) or a long short-term memory RNN language model (LSTM-LM) is first created from all documents in a task; some of the LM parameters are then adapted by each document, and the adapted parameters are vectorized to represent the document. The new document vectors are labeled as DV-RNN and DV-LSTM respectively. We believe that our new document vectors can capture some high-level sequential information in the documents, which other current document representations fail to capture. The new document vectors were evaluated in the genre classification of documents in three corpora: the Brown Corpus, the BNC Baby Corpus and an artificially created Penn Treebank dataset. Their classification performances are compared with the performance of TF-IDF vector and the state-of-the-art distributed memory model of paragraph vector (PV-DM). The results show that DV-LSTM significantly outperforms TF-IDF and PV-DM in most cases, and combinations of the proposed document vectors with TF-IDF or PV-DM may further improve performance.

[1] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[2] Jing Cao,et al. Automatic Genre Classification via N-grams of Part-of-Speech Tags☆ , 2015 .

[3] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[4] Carol Van Ess-Dykema,et al. The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[5] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[6] Hermann Ney,et al. rwthlm - the RWTH aachen university neural network language modeling toolkit , 2014, INTERSPEECH.

[7] Efstathios Stamatatos,et al. Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[8] Sung-Hyon Myaeng,et al. Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[9] Stephen E. Robertson,et al. Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[10] Katja Markert,et al. Fine-Grained Genre Classification Using Structural Learning Algorithms , 2010, ACL.

[11] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[12] Zhiyuan Liu,et al. A C-LSTM Neural Network for Text Classification , 2015, ArXiv.

[13] Yoshua Bengio,et al. Neural Probabilistic Language Models , 2006 .

[14] Jun Zhao,et al. Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[15] Rui Zhang,et al. Dependency Sensitive Convolutional Neural Networks for Modeling Sentences and Documents , 2016, NAACL.

[16] Hermann Ney,et al. From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Bonnie L. Webber,et al. Genre distinctions for discourse in the Penn TreeBank , 2009, ACL.

[19] Quoc V. Le,et al. Document Embedding with Paragraph Vectors , 2015, ArXiv.

[20] Charles L. A. Clarke,et al. Towards genre classification for IR in the workplace , 2006, IIiX.

[21] Ting Liu,et al. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[22] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[23] Ana Margarida de Jesus,et al. Improving Methods for Single-label Text Categorization , 2007 .

[24] Diyi Yang,et al. Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[25] Hinrich Schütze,et al. Automatic Detection of Text Genre , 1997, ACL.

[26] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[27] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[28] Jussi Karlgren,et al. Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[29] Mathias Kirsten,et al. Exploring the Use of Linguistic Features in Domain and Genre Classification , 1999, EACL.

[30] Jiafeng Guo,et al. Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[31] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[32] Christopher D. Manning,et al. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.