Towards a real-time processing framework based on improved distributed recurrent neural network variants with fastText for social big data analytics

Abstract Big data generated by social media stands for a valuable source of information, which offers an excellent opportunity to mine valuable insights. Particularly, User-generated contents such as reviews, recommendations, and users’ behavior data are useful for supporting several marketing activities of many companies. Knowing what users are saying about the products they bought or the services they used through reviews in social media represents a key factor for making decisions. Sentiment analysis is one of the fundamental tasks in Natural Language Processing. Although deep learning for sentiment analysis has achieved great success and allowed several firms to analyze and extract relevant information from their textual data, but as the volume of data grows, a model that runs in a traditional environment cannot be effective, which implies the importance of efficient distributed deep learning models for social Big Data analytics. Besides, it is known that social media analysis is a complex process, which involves a set of complex tasks. Therefore, it is important to address the challenges and issues of social big data analytics and enhance the performance of deep learning techniques in terms of classification accuracy to obtain better decisions. In this paper, we propose an approach for sentiment analysis, which is devoted to adopting fastText with Recurrent neural network variants to represent textual data efficiently. Then, it employs the new representations to perform the classification task. Its main objective is to enhance the performance of well-known Recurrent Neural Network (RNN) variants in terms of classification accuracy and handle large scale data. In addition, we propose a distributed intelligent system for real-time social big data analytics. It is designed to ingest, store, process, index, and visualize the huge amount of information in real-time. The proposed system adopts distributed machine learning with our proposed method for enhancing decision-making processes. Extensive experiments conducted on two benchmark data sets demonstrate that our proposal for sentiment analysis outperforms well-known distributed recurrent neural network variants (i.e., Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), and Gated Recurrent Unit (GRU)). Specifically, we tested the efficiency of our approach using the three different deep learning models. The results show that our proposed approach is able to enhance the performance of the three models. The current work can provide several benefits for researchers and practitioners who want to collect, handle, analyze and visualize several sources of information in real-time. Also, it can contribute to a better understanding of public opinion and user behaviors using our proposed system with the improved variants of the most powerful distributed deep learning and machine learning algorithms. Furthermore, it is able to increase the classification accuracy of several existing works based on RNN models for sentiment analysis.

[1]  Terry Anthony Byrd,et al.  Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations , 2018 .

[2]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[3]  Laurence T. Yang,et al.  A survey on deep learning for big data , 2018, Inf. Fusion.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Kyung-shik Shin,et al.  Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in Korean , 2019, Inf. Process. Manag..

[6]  Hadi Veisi,et al.  Sentiment analysis based on improved pre-trained word embeddings , 2019, Expert Syst. Appl..

[7]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[8]  Gui Xiaolin,et al.  Deep Convolution Neural Networks for Twitter Sentiment Analysis , 2018, IEEE Access.

[9]  Erik Brynjolfsson,et al.  Big data: the management revolution. , 2012, Harvard business review.

[10]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[11]  Francisco Herrera,et al.  Enabling Smart Data: Noise filtering in Big Data classification , 2017, Inf. Sci..

[12]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[13]  Mahmoud Al-Ayyoub,et al.  Enhancing Aspect-Based Sentiment Analysis of Arabic Hotels' reviews using morphological, syntactic and semantic features , 2019, Inf. Process. Manag..

[14]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[15]  Giustina Secundo,et al.  Creating value from Social Big Data: Implications for Smart Tourism Destinations , 2017, Inf. Process. Manag..

[16]  Jason J. Jung,et al.  Social big data: Recent achievements and new challenges , 2015, Information Fusion.

[17]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[18]  Xin Wang,et al.  Predicting Polarities of Tweets by Composing Word Embeddings with Long Short-Term Memory , 2015, ACL.

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[21]  Stefan Stieglitz,et al.  Social media analytics - Challenges in topic discovery, data collection, and data preparation , 2018, Int. J. Inf. Manag..

[22]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[23]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[24]  Belén Ruíz-Mezcua,et al.  Towards a big data framework for analyzing social media content , 2019, Int. J. Inf. Manag..

[25]  Tao Chen,et al.  Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN , 2017, Expert Syst. Appl..

[26]  Ayoub Ait Lahcen,et al.  APRA: An approximate parallel recommendation algorithm for Big Data , 2018, Knowl. Based Syst..

[27]  Mohsen Guizani,et al.  Deep Learning for IoT Big Data and Streaming Analytics: A Survey , 2017, IEEE Communications Surveys & Tutorials.

[28]  Eric W. T. Ngai,et al.  Social media research: Theories, constructs, and conceptual frameworks , 2015, Int. J. Inf. Manag..

[29]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[30]  Soroush Vosoughi,et al.  Tweet2Vec: Learning Tweet Embeddings Using Character-level CNN-LSTM Encoder-Decoder , 2016, SIGIR.

[31]  Minh-Le Nguyen,et al.  An Ensemble Method with Sentiment Features and Clustering Support , 2017, IJCNLP.

[32]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[33]  Murtaza Haider,et al.  Beyond the hype: Big data concepts, methods, and analytics , 2015, Int. J. Inf. Manag..

[34]  Siti Mariyam Shamsuddin,et al.  Deep learning-based sentiment classification of evaluative text based on Multi-feature fusion , 2019, Inf. Process. Manag..

[35]  Gang Liu,et al.  Bidirectional LSTM with attention mechanism and convolutional layer for text classification , 2019, Neurocomputing.

[36]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Gui Xiaolin,et al.  Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis , 2017, IEEE Access.

[38]  In Lee,et al.  Big data: Dimensions, evolution, impacts, and challenges , 2017 .

[39]  Awais Ahmad,et al.  Deep learning in big data Analytics: A comparative study , 2017, Comput. Electr. Eng..

[40]  Mohammad Salehan,et al.  Predicting the performance of online consumer reviews: A sentiment mining approach to big data analytics , 2014, Decis. Support Syst..

[41]  Francisco Herrera,et al.  A comparison on scalability for batch big data processing on Apache Spark and Apache Flink , 2017 .

[42]  Anh-Cuong Le,et al.  Exploiting multiple word embeddings and one-hot character vectors for aspect-based sentiment analysis , 2018, Int. J. Approx. Reason..

[43]  Francisco Herrera,et al.  Sentiment Analysis in TripAdvisor , 2017, IEEE Intelligent Systems.

[44]  Elise de Doncker,et al.  Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information , 2019, Cognitive Systems Research.

[45]  Z. Schwartz,et al.  What can big data and text analytics tell us about hotel guest experience and satisfaction , 2015 .

[46]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[47]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[50]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[51]  Lawrence Carin,et al.  Sparse multinomial logistic regression: fast algorithms and generalization bounds , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Sushma Jain,et al.  A survey towards an integration of big data analytics to big insights for value-creation , 2018, Inf. Process. Manag..

[53]  Vidhyacharan Bhaskar,et al.  Big data analytics for disaster response and recovery through sentiment analysis , 2018, Int. J. Inf. Manag..

[54]  Kenny Q. Zhu,et al.  Multi-channel BiLSTM-CRF Model for Emerging Named Entity Recognition in Social Media , 2017, NUT@EMNLP.

[55]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[56]  Laurence T. Yang,et al.  Deep Computation Model for Unsupervised Feature Learning on Big Data , 2016, IEEE Transactions on Services Computing.

[57]  K. Robert Lai,et al.  Refining Word Embeddings for Sentiment Analysis , 2017, EMNLP.

[58]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[59]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[60]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[61]  K. Robert Lai,et al.  Dimensional Sentiment Analysis Using a Regional CNN-LSTM Model , 2016, ACL.

[62]  Gavin C. Cawley,et al.  Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation , 2006, NIPS.

[63]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[64]  Taghi M. Khoshgoftaar,et al.  Deep learning applications and challenges in big data analytics , 2015, Journal of Big Data.

[65]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.