Derivation of Document Vectors from Adaptation of LSTM Language Model

In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document. This paper proposes a novel distributed vector representation of a document, which will be labeled as DV-LSTM, and is derived from the result of adapting a long short-term memory recurrent neural network language model by the document. DV-LSTM is expected to capture some high-level sequential information in the document, which other current document representations fail to do. It was evaluated in document genre classification in the Brown Corpus and the BNC Baby Corpus. The results show that DV-LSTM significantly outperforms TF-IDF vector and paragraph vector (PV-DM) in most cases, and their combinations may further improve the classification performance.

[1]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Bonnie L. Webber,et al.  Genre distinctions for discourse in the Penn TreeBank , 2009, ACL.

[3]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[4]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[5]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[6]  Hermann Ney,et al.  rwthlm - the RWTH aachen university neural network language modeling toolkit , 2014, INTERSPEECH.

[7]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[8]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[9]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[10]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[11]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[12]  Jiafeng Guo,et al.  Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[13]  Jing Cao,et al.  Automatic Genre Classification via N-grams of Part-of-Speech Tags☆ , 2015 .

[14]  Philipp Petrenz,et al.  Cross-Lingual Genre Classification , 2012, EACL.

[15]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[16]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[17]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[18]  Katja Markert,et al.  Fine-Grained Genre Classification Using Structural Learning Algorithms , 2010, ACL.

[19]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[20]  Ana Margarida de Jesus,et al.  Improving Methods for Single-label Text Categorization , 2007 .

[21]  Bonnie L. Webber,et al.  Squibs: Stable Classification of Text Genres , 2011, CL.

[22]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[23]  Hermann Ney,et al.  From Feedforward to Recurrent LSTM Neural Networks for Language Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.