News based forecasting and modeling

This thesis focuses on forecasting and modeling problems based on quantitative news data. The news data in my experiments are produced from Lydia , our news analytical system, which is capable of analyzing spatial, temporal, and linguistic statistics of named entity occurrences in text corpora across different news sources. Specifically, the problems I studied fall into two categories: (1) how could the news data help people to analyze and predict societal variables, especially financial variables like movie gross, stock prices, etc. (2) what is the process by which news data is generated, and how can we predict the distribution of future news generations. On the one hand, traditional financial analysis lays emphasis on how price data incorporate other relevant financial indicators. Since the 1990s, linguistic sources such as news have been continuously proven to carry extra and meaningful information beyond traditional quantitative financial data, and thus they can be used as predictive indicators in finance. In this thesis, we conduct a comprehensive study on large-scale news data modeling and how they help people on financial analysis in a large sense with analyzing two important financial markets. First, we show how news data help people to build models to analyze and predict financial markets with coarse time granularity, such as movie market. The next, we show how financial markets with finer time granularity such as stock markets could be factored and analyzed with news as well. Our analysis provides concrete evidence in confirming that news data are highly informative and have significant predictive power on financial analysis, which is previously mentioned in some literatures but has never been practically proven by real large-scale analysis. On the other hand, the thesis will also study news statistical patterns, build models to generate news time series, and try to forecast future news fluctuations. Our statistical analysis shows that log-normal and power-law distributions generally could describe news behaviors in many aspects. Based on the principles we discovered, we proposed two models—Log-Normal (LN) model, and an innovative Layered Hidden Markov Model (LHMM) to describe news. Our careful studies show that LHMM model is overall a favorable model to simulate news data and forecast future news pulses. Most importantly, we study and forecast the future of news entities in a group context. Based on our analysis, we could answer some interesting news forecasting questions. For example, what is the probability that an entity become the most famous one among a group? And what is the likelihood that a trivial entity becomes incredibly important in the next certain time period? Our analysis shows these questions could be solved by fitting power-law tails and we validated the model with several interesting news groups in different domains. Our study provides very useful insights for the analysis of issues in finance, political science, or social science.