PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks

Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector, have been attracting increasing attention due to their simplicity, scalability, and effectiveness. However, comparing to sophisticated deep learning architectures such as convolutional neural networks, these methods usually yield inferior results when applied to particular machine learning tasks. One possible reason is that these text embedding methods learn the representation of text in a fully unsupervised way, without leveraging the labeled information available for the task. Although the low dimensional representations learned are applicable to many different tasks, they are not particularly tuned for any task. In this paper, we fill this gap by proposing a semi-supervised representation learning method for text data, which we call the predictive text embedding (PTE). Predictive text embedding utilizes both labeled and unlabeled data to learn the embedding of text. The labeled information and different levels of word co-occurrence information are first represented as a large-scale heterogeneous text network, which is then embedded into a low dimensional space through a principled and efficient algorithm. This low dimensional embedding not only preserves the semantic closeness of words and documents, but also has a strong predictive power for the particular task. Compared to recent supervised approaches based on convolutional neural networks, predictive text embedding is comparable or more effective, much more efficient, and has fewer parameters to tune.

[1]  Zhiyuan Liu,et al.  Topical Word Embeddings , 2015, AAAI.

[2]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[3]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[5]  Ohad Shamir,et al.  Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing , 2011, COLT.

[6]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[7]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[8]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  Graham Cormode,et al.  Node Classification in Social Networks , 2011, Social Network Data Analytics.

[11]  Jason Weston,et al.  #TagSpace: Semantic Embeddings from Hashtags , 2014, EMNLP.

[12]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[13]  Qiaozhu Mei,et al.  Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis , 2014, ICML.

[14]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[15]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[16]  Maksims Volkovs,et al.  BoltzRank: learning to maximize expected ranking gain , 2009, ICML '09.

[17]  Marc'Aurelio Ranzato,et al.  Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews , 2014, ICLR.

[18]  Jon Kleinberg,et al.  The link prediction problem for social networks , 2003, CIKM '03.

[19]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[21]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[22]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[23]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[24]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[25]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[26]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[27]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[29]  Stan Matwin,et al.  Large Scale Text Classification using Semisupervised Multinomial Naive Bayes , 2011, ICML.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[32]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[33]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[34]  Rong Jin,et al.  Efficient multi-label ranking for multi-class learning: Application to object recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[36]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[37]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.