Topic Detection and Document Similarity on Financial News

Traders often rely on financial news to come up with predictions for stock price changes. Dealing with vast amount of news data makes it essential to use an automated methodology to identify the relevant news items for a given criteria. In this study we use Latent Dirichlet Allocation (LDA) to model the correlation of news items with stock price time series data. LDA model is trained with news items from a time window in the past and then the trained model is used to measure the similarity between the current news items and the news items used for training. Calculated similarity measure can be used as a predictor for switching points in the future. We tested our methodology using a collection of about 1,700,000 financial news items published between 2015-01-01 and 2015-12-31, and compared the results with various standard classification techniques. Our results indicate that use of LDA instead of standard classification techniques makes it possible to achieve the same level of performance by using a much smaller feature space.

[1]  Francis R. Bach,et al.  Online Learning for Latent Dirichlet Allocation , 2010, NIPS.

[2]  Dipanjan Sarkar,et al.  Text Analytics with Python , 2016, Apress.

[3]  Marc-André Mittermayer,et al.  Forecasting Intraday stock price trends with text mining techniques , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[4]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[5]  Kiyoaki Shirai,et al.  Topic Modeling based Sentiment Analysis on Social Media for Stock Market Prediction , 2015, ACL.

[6]  Gilles Adda,et al.  Towards tokenization evaluation , 1998, LREC.

[7]  Shimon Kogan,et al.  Which News Moves Stock Prices? A Textual Analysis , 2013 .

[8]  M. Avellaneda,et al.  High-frequency trading in a limit order book , 2008 .

[9]  Rebecca J. Passonneau,et al.  Semantic Frames to Predict Stock Price Movement , 2013, ACL.

[10]  Li Chen,et al.  News impact on stock price return via sentiment analysis , 2014, Knowl. Based Syst..

[11]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[12]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[13]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[14]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[15]  Thomas L. Griffiths,et al.  Online Inference of Topics with Latent Dirichlet Allocation , 2009, AISTATS.

[16]  Katia Sycara,et al.  GP and the Predictive Power of Internet Message Traffic , 2002 .