Stream-based active learning for sentiment analysis in the financial domain

Studying the relationship between public sentiment and stock prices has been the focus of several studies. This paper analyzes whether the sentiment expressed in Twitter feeds, which discuss selected companies and their products, can indicate their stock price changes. To address this problem, an active learning approach was developed and applied to sentiment analysis of tweet streams in the stock market domain. The paper first presents a static Twitter data analysis problem, explored in order to determine the best Twitter-specific text preprocessing setting for training the Support Vector Machine (SVM) sentiment classifier. In the static setting, the Granger causality test shows that sentiments in stock-related tweets can be used as indicators of stock price movements a few days in advance, where improved results were achieved by adapting the SVM classifier to categorize Twitter posts into three sentiment categories of positive, negative and neutral (instead of positive and negative only). These findings were adopted in the development of a new stream-based active learning approach to sentiment analysis, applicable in incremental learning from continuously changing financial tweet streams. To this end, a series of experiments was conducted to determine the best querying strategy for active learning of the SVM classifier adapted to sentiment analysis of financial tweet streams. The experiments in analyzing stock market sentiments of a particular company show that changes in positive sentiment probability can be used as indicators of the changes in stock closing prices.

[1]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[2]  Jesse Read,et al.  Data Stream Mining , 2014 .

[3]  Anil K. Seth,et al.  Granger causality , 2007, Scholarpedia.

[4]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[5]  John R. Nofsinger Social Mood and Financial Economics , 2005 .

[6]  Nada Lavrac,et al.  Predictive Sentiment Analysis of Tweets: A Stock Market Application , 2013, CHI-KDD.

[7]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[8]  Anshul Mittal,et al.  Stock Prediction Using Twitter Sentiment Analysis , 2011 .

[9]  Li Guo,et al.  Mining Data Streams with Labeled and Unlabeled Training Examples , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[10]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[11]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[12]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[13]  Charles Song,et al.  SOPS: Stock Prediction Using Web Sentiment , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[14]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[15]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[16]  Martin Saveski Web Services for Stream Mining : A Stream-Based Active Learning Use Case , 2011 .

[17]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[18]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[19]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  Martin Žnidaršič,et al.  Sentiment analysis on tweets in a financial domain , 2012 .

[22]  Tiejun Zhao,et al.  Target-dependent Twitter Sentiment Classification , 2011, ACL.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  M. Schervish P Values: What They are and What They are Not , 1996 .

[25]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[26]  William M. Shaw,et al.  On the foundation of evaluation , 1986, J. Am. Soc. Inf. Sci..

[27]  Xiaodong Lin,et al.  Active Learning from Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[28]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[29]  Kirt C. Butler,et al.  Efficiency and inefficiency in thinly traded stock markets: Kuwait and Saudi Arabia , 1992 .

[30]  Isabell M. Welpe,et al.  Tweets and Trades: The Information Content of Stock Microblogs , 2010 .

[31]  Raymond J. Mooney,et al.  Diverse ensembles for active learning , 2004, ICML.

[32]  D. Sculley,et al.  Online Active Learning Methods for Fast Label-Efficient Spam Filtering , 2007, CEAS.

[33]  E. Fama Random Walks in Stock Market Prices , 1965 .

[34]  P. Gloor,et al.  Predicting Stock Market Indicators Through Twitter “I hope it is not as bad as I fear” , 2011 .

[35]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[36]  Yang Yu,et al.  The impact of social and conventional media on firm equity value: A sentiment analysis approach , 2013, Decis. Support Syst..

[37]  C. Granger Investigating causal relations by econometric models and cross-spectral methods , 1969 .

[38]  Albert Bifet,et al.  Sentiment Knowledge Discovery in Twitter Streaming Data , 2010, Discovery Science.

[39]  Yanghui Rao,et al.  Sentiment topic models for social emotion mining , 2014, Inf. Sci..

[40]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[41]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[42]  Geoff Holmes,et al.  Active Learning with Evolving Streaming Data , 2011, ECML/PKDD.

[43]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[44]  Li Guo,et al.  Mining Multi-Label Data Streams Using Ensemble-Based Active Learning , 2012, SDM.

[45]  Yong Shi,et al.  The Role of Text Pre-processing in Sentiment Analysis , 2013, ITQM.

[46]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[47]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[48]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[49]  Ray Chen,et al.  Analysis of Twitter Feeds for the Prediction of Stock Market Movement , 2011 .

[50]  Aristides Gionis,et al.  Correlating financial time series with micro-blogging activity , 2012, WSDM '12.

[51]  Ramanathan V. Guha,et al.  The predictive power of online chatter , 2005, KDD '05.

[52]  Detlef Schoder,et al.  Predictive Analytics On Public Data - The Case Of Stock Markets , 2013, ECIS.

[53]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[54]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[55]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[56]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[57]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[58]  Sean A. Spence,et al.  Descartes' Error: Emotion, Reason and the Human Brain , 1995 .

[59]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[60]  Thorsten Joachims,et al.  Sparse kernel SVMs via cutting-plane training , 2009, Machine Learning.

[61]  L. Bachelier,et al.  Théorie de la spéculation , 1900 .

[62]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[63]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[64]  C Frith DESCARTES ERROR - EMOTION, REASON AND THE HUMAN BRAIN - DAMASIO,AR , 1995 .

[65]  Olivia Sheng,et al.  Investigating Predictive Power of Stock Micro Blog Sentiment in Forecasting Future Stock Price Directional Movement , 2011, ICIS.

[66]  Mike Thelwall,et al.  Sentiment in Twitter events , 2011, J. Assoc. Inf. Sci. Technol..

[67]  Gilad Mishne,et al.  Predicting Movie Sales from Blogger Sentiment , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[68]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[69]  Jonathon Read,et al.  Using Emoticons to Reduce Dependency in Machine Learning Techniques for Sentiment Classification , 2005, ACL.

[70]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[71]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[72]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[73]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[74]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[75]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[76]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[77]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[78]  Burr Settles,et al.  From Theories to Queries: Active Learning in Practice , 2011 .

[79]  Isabell M. Welpe,et al.  Tweets and Trades: The Information Content of Stock Microblogs , 2010 .

[80]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[81]  D. Sculley,et al.  Combined regression and ranking , 2010, KDD.

[82]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[83]  Timothy W. Finin,et al.  Delta TFIDF: An Improved Feature Space for Sentiment Analysis , 2009, ICWSM.

[84]  Manolis G. Kavussanos,et al.  A multivariate test for stock market efficiency: the case of ASE , 2001 .