Integrating Multiple Data Sources for Stock Prediction

In many real world applications, decisions are usually made by collecting and judging information from multiple different data sources. Let us take the stock market as an example. We never make our decision based on just one single piece of advice, but always rely on a collection of information, such as the stock price movements, exchange volumes, market index, as well as the information from the news articles, expert comments and special announcements (e.g., the increase of stamp duty). Yet, modeling the stock market is difficult because: (1) The process related to market states (up and down) is a stochastic process, which is hard to capture by using the deterministic approach; and (2) The market state is invisible but will be influenced by the visible market information, such as stock prices and news articles. In this paper, we try to model the stock market process by using a Non-homogeneous Hidden Markov Model (NHMM) which takes multiple sources of information into account when making a future prediction. Our model contains three major elements: (1) External event, which denotes the events happening within the stock market (e.g., the drop of US interest rate); (2) Observed market state, which denotes the current market status (e.g. the rise in the stock price); and (3) Hidden market state, which conceptually exists but is invisible to the market participants. Specifically, we model the external events by using the information contained in the news articles, and model the observed market state by using the historical stock prices. Base on these two pieces of observable information and the previous hidden market state, we aim to identify the current hidden market state, so as to predict the immediate market movement. Extensive experiments were conducted to evaluate our work. The encouraging results indicate that our proposed approach is practically sound and effective.

[1]  Virginia Teller Review of Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition by Daniel Jurafsky and James H. Martin. Prentice Hall 2000. , 2000 .

[2]  Donghui Zhang,et al.  Online event-driven subsequence matching over financial data streams , 2004, SIGMOD '04.

[3]  Jian Zhang,et al.  Daily Prediction of Major Stock Indices from Textual WWW Data , 1998, KDD.

[4]  Padhraic Smyth,et al.  Modeling of multivariate time series using hidden markov models , 2005 .

[5]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[8]  T. Hellström,et al.  Predicting the Stock Market , 1998 .

[9]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[10]  David D. Jensen,et al.  Mining of Concurrent Text and Time Series , 2008 .

[11]  Hongjun Lu,et al.  The Predicting Power of Textual Information on Financial Markets , 2005, IEEE Intell. Informatics Bull..

[12]  Wai Lam,et al.  News Sensitive Stock Trend Prediction , 2002, PAKDD.

[13]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.

[14]  P. Guttorp,et al.  A non‐homogeneous hidden Markov model for precipitation occurrence , 1999 .

[15]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[16]  B. Wuthrich,et al.  Probabilistic knowledge bases , 1995 .