Hybrid Deep Neural Network-Based Text Representation Model to Improve Microblog Retrieval

Abstract Retrieving relevant information from Twitter is always a challenging task given its vocabulary mismatch, sheer volume and noise. Representing the content of text tweets is a critical part of any microblog retrieval model. For this reason, deep neural networks can be used for learning good representations of text data and then conduct to a better matching. In this paper, we are interested in improving both representation and retrieval effectiveness in microblogs. For that, a Hybrid-Deep Neural-Network-based text representation model is proposed to extract effective features’ representations for clustering oriented microblog retrieval. HDNN combines recurrent neural network and feedforward neural network architectures. Specifically, using a bi-directional LSTM, we first generate a deep contextualized word representation which incorporates character n-grams form FasText. However, these contextual embedded existing in a high-dimensional space are not all important. Some of them are redundant, correlated and sometimes noisy making the learning models over-fitting, complex and less interpretable. To deal with these problems, we proposed a Hybrid-Regularized-Autoencoder-based method which combines autoencoder with Elastic Net regularization for an effective unsupervised feature selection and extraction. Our experimental results show that the performance of clustering and especially information retrieval in microblogs depend heavily on features’ representation.

[1]  Ronan Collobert,et al.  Word Embeddings through Hellinger PCA , 2013, EACL.

[2]  Mingming Shi,et al.  On Improving a Microblog Ranking , 2016, 2016 IEEE First International Conference on Data Science in Cyberspace (DSC).

[3]  Vikas Raunak Simple and Effective Dimensionality Reduction for Word Embeddings , 2017 .

[4]  Donald Metzler,et al.  USC/ISI at TREC 2011: Microblog Track , 2011, TREC.

[5]  Matteo Magnani,et al.  Conversation Retrieval from Twitter , 2011, ECIR.

[6]  Cherif Chiraz Latiri,et al.  Short Query Expansion for Microblog Retrieval , 2016, KES.

[7]  Le Zhao,et al.  Modeling and solving term mismatch for full-text retrieval , 2012, SIGF.

[8]  Mohand Boughanem,et al.  Effectiveness of state-of-the-art features for microblog search , 2013, SAC '13.

[9]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[10]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[11]  Mandar Mitra,et al.  Exploring Query Categorisation for Query Expansion: A Study , 2015, ArXiv.

[12]  E. Chang,et al.  A survey in traditional information retrieval models , 2008, 2008 2nd IEEE International Conference on Digital Ecosystems and Technologies.

[13]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[14]  Holger Schwenk,et al.  CSLM - a modular open-source continuous space language modeling toolkit , 2013, INTERSPEECH.

[15]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Henda Hajjami Ben Ghézala,et al.  Comparative study of word embedding methods in topic segmentation , 2017, KES.

[17]  Craig MacDonald,et al.  Using word embeddings in Twitter election classification , 2016, Information Retrieval Journal.

[18]  Mariano Sigman,et al.  Corpus Specificity in LSA and Word2vec: The Role of Out-of-Domain Documents , 2017, Rep4NLP@ACL.

[19]  M. de Rijke,et al.  Incorporating Query Expansion and Quality Indicators in Searching Microblog Posts , 2011, ECIR.

[20]  Ramandeep Kaur,et al.  A Survey of Clustering Techniques , 2010 .

[21]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[22]  Jure Leskovec,et al.  Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change , 2016, ACL.

[23]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[24]  Dan Roth,et al.  Word Embeddings with Limited Memory , 2016, ACL.

[25]  Geert-Jan Houben,et al.  Deriving Knowledge Profiles from Twitter , 2011, EC-TEL.

[26]  Angeliki Lazaridou,et al.  Combining Language and Vision with a Multimodal Skip-gram Model , 2015, NAACL.

[27]  Krys J. Kochut,et al.  A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques , 2017, ArXiv.

[28]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[29]  Heyan Huang,et al.  Query Expansion Based on a Feedback Concept Model for Microblog Retrieval , 2017, WWW.

[30]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[31]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[32]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[33]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[34]  Eduard H. Hovy,et al.  Structured Event Retrieval over Microblog Archives , 2012, NAACL.

[35]  Mengen Chen,et al.  Short Text Classification Improved by Learning Multi-Granularity Topics , 2011, IJCAI.

[36]  Mohand Boughanem,et al.  Uprising microblogs: a bayesian network retrieval model for tweet search , 2012, SAC '12.

[37]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[38]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[39]  Nittaya Kerdprasop,et al.  A Method to Clustering the Feature Ranking on Data Classification Using an Ensemble Feature Selection , 2017 .

[40]  Ngoc Thang Vu,et al.  Neural-based Noise Filtering from Word Embeddings , 2016, COLING.

[41]  Martine De Cock,et al.  Ranking Approaches for Microblog Search , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[42]  Bhaskar Mitra,et al.  An Introduction to Neural Information Retrieval , 2018, Found. Trends Inf. Retr..

[43]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Michael Guerzhoy,et al.  Deep Neural Networks , 2013 .

[45]  Ling Feng,et al.  A Tweet-Centric Approach for Topic-Specific Author Ranking in Micro-Blog , 2011, ADMA.

[46]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[47]  Amir Bakarov,et al.  A Survey of Word Embeddings Evaluation Methods , 2018, ArXiv.

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[50]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[51]  R. Krishnapuram,et al.  A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering , 1999, FUZZ-IEEE'99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No.99CH36315).

[52]  J. Kalita,et al.  Automatic Summarization of Twitter Topics , 2010 .

[53]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[54]  Niladri Sekhar Dash,et al.  Context and Contextual Word Meaning , 2008 .

[55]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[56]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[57]  Yubin Kim,et al.  Overcoming Vocabulary Limitations in Twitter Microblogs , 2012, TREC.

[58]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[59]  Halit Oguztüzün,et al.  Semantic Expansion of Tweet Contents for Enhanced Event Detection in Twitter , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[60]  Stelios Krinidis,et al.  A Robust Fuzzy Local Information C-Means Clustering Algorithm , 2010, IEEE Transactions on Image Processing.

[61]  Mariano Sigman,et al.  Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database , 2016, ArXiv.

[62]  Ngoc Thang Vu,et al.  Neural-based Context Representation Learning for Dialog Act Classification , 2017, SIGDIAL Conference.

[63]  W. Bruce Croft,et al.  Quality models for microblog retrieval , 2012, CIKM.

[64]  Bhaskar Mitra,et al.  A Dual Embedding Space Model for Document Ranking , 2016, ArXiv.

[65]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[66]  Mary Beth Rosson,et al.  How and why people Twitter: the role that micro-blogging plays in informal communication at work , 2009, GROUP.

[67]  Martin Andrews Compressing Word Embeddings , 2016, ICONIP.

[68]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[69]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[70]  Sindur Patel,et al.  A Survey of Information Retrieval on Microblog , 2017 .

[71]  Prasenjit Majumder,et al.  Query Expansion for Microblog Retrieval , 2011, TREC.

[72]  F. Damak Recherche d'information dans les microblogs : que manque-t-il aux approches classiques ? , 2013 .

[73]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[74]  Hermann Ney,et al.  Feature Extraction with Convolutional Neural Networks for Handwritten Word Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[75]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.

[76]  Gustaf Neumann,et al.  Parameters driving effectiveness of automated essay scoring with LSA , 2005 .

[77]  Maher Ben Jemaa,et al.  A Semantic Approach for Tweet Categorization , 2018, KES.

[78]  Kwang In Kim,et al.  Face recognition using kernel principal component analysis , 2002, IEEE Signal Processing Letters.

[79]  Babak Vaziri,et al.  A Survey on Clustering Algorithms for Partitioning Method , 2016 .

[80]  Jimmy J. Lin,et al.  The Neural Hype and Comparisons Against Weak Baselines , 2019, SIGIR Forum.

[81]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[82]  Rui Li,et al.  A Time-Aware Language Model for Microblog Retrieval , 2012, TREC.

[83]  Ajay Rana,et al.  K-means with Three different Distance Metrics , 2013 .

[84]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[85]  Philippe Mulhem,et al.  Hybrid query expansion model for text and microblog information retrieval , 2018, Information Retrieval Journal.

[86]  Hongxun Yao,et al.  Auto-encoder based dimensionality reduction , 2016, Neurocomputing.

[87]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[88]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[89]  Peng Wang,et al.  Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification , 2016, Neurocomputing.

[90]  Muyun Yang,et al.  A Hyperlink-Extended Language Model for Microblog Retrieval , 2015 .

[91]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[92]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[93]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[94]  Min Song,et al.  Integration of association rules and ontologies for semantic query expansion , 2007, Data Knowl. Eng..

[95]  Subbarao Kambhampati,et al.  Ranking tweets considering trust and relevance , 2012, IIWeb '12.

[96]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[97]  Harry Shum,et al.  An Empirical Study on Learning to Rank of Tweets , 2010, COLING.

[98]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[99]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.