论文信息 - Derivation of Document Vectors from Adaptation of LSTM Language Model

Derivation of Document Vectors from Adaptation of LSTM Language Model

In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document. This paper proposes a novel distributed vector representation of a document, which will be labeled as DV-LSTM, and is derived from the result of adapting a long short-term memory recurrent neural network language model by the document. DV-LSTM is expected to capture some high-level sequential information in the document, which other current document representations fail to do. It was evaluated in document genre classification in the Brown Corpus and the BNC Baby Corpus. The results show that DV-LSTM significantly outperforms TF-IDF vector and paragraph vector (PV-DM) in most cases, and their combinations may further improve the classification performance.

Wei Li | Brian Kan-Wing Mak | Wei Li | B. Mak

[1] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Bonnie L. Webber,et al. Genre distinctions for discourse in the Penn TreeBank , 2009, ACL.

[3] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[4] Quoc V. Le,et al. Document Embedding with Paragraph Vectors , 2015, ArXiv.

[5] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[6] Hermann Ney,et al. rwthlm - the RWTH aachen university neural network language modeling toolkit , 2014, INTERSPEECH.

[7] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[8] Steven Bird,et al. NLTK: The Natural Language Toolkit , 2002, ACL.

[9] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[10] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[11] Robert L. Mercer,et al. Class-Based n-gram Models of Natural Language , 1992, CL.

[12] Jiafeng Guo,et al. Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[13] Jing Cao,et al. Automatic Genre Classification via N-grams of Part-of-Speech Tags☆ , 2015 .