A preprocessing method of internet search data for prediction improvement: application to Chinese stock market

The correlations between Internet search data and socio-economic Indicators have been proved in many studies, but the basis work of these studies - data preprocessing, determining the quality of the result, has lacked a systematic methodology. In this paper, we develop a comprehensive method for Internet search data preprocessing, which includes the critical steps: (a) keywords selection, (b) time difference measurement, and (c) leading index composition. Applying our method to study Chinese stock market price, we can get the leading keywords index with stable leading relation and high degree of fit. Specifically, the correlation coefficient between our leading keywords index and Shanghai Composite Index reaches 98.7%, and Granger test confirms that keywords index has significant prediction ability for Shanghai Composite Index. Adding keywords index to the AR model can reduce the MAPE from 3.8% to 1.4%, and each percentage point change of keywords index is correlated with 0.136 percentage point move in the same direction of Shanghai Composite Index in next period.

[1]  Maximilian Podstawski,et al.  Google Searches as a Means of Improving the Nowcasts of Key Macroeconomic Variables , 2009 .

[2]  H. Varian,et al.  Predicting the Present with Google Trends , 2009 .

[3]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[4]  Tanya Suhoy,et al.  Query Indices and a 2008 Downturn: Israeli Data , 2009 .

[5]  Torsten Schmidt,et al.  Forecasting Private Consumption: Survey-Based Indicators vs. Google Trends , 2009 .

[6]  Hyun-young Choi,et al.  Predicting Initial Claims for Unemployment Benefits , 2009 .

[7]  Nicolás Della Penna,et al.  Constructing Consumer Sentiment Index for U.S. Using Google Searches , 2010 .

[8]  Jurgen A. Doornik,et al.  Improving the Timeliness of Data on Influenza-like Illnesses using Google Search Data , 2010 .

[9]  Bing Pan,et al.  A poisson regression examination of the relationship between website traffic and search engine queries , 2012, NETNOMICS: Economic Research and Electronic Networking.

[10]  A. Hulth,et al.  Web Queries as a Source for Syndromic Surveillance , 2009, PloS one.

[11]  N. Askitas,et al.  Google Econometrics and Unemployment Forecasting , 2009, SSRN Electronic Journal.

[12]  Julius Shiskin,et al.  Indicators of Business Expansions and Contractions , 1968 .

[13]  Ernst A. Boehm,et al.  The Contribution of Economic Indicator Analysis to Understanding and Forecasting Business Cycles , 2001 .

[14]  D. W. Allan,et al.  Picosecond Time Difference Measurement System , 1975 .

[15]  E. Brynjolfsson,et al.  The Future of Prediction: How Google Searches Foreshadow Housing Prices and Sales , 2013, ICIS 2013.

[16]  Torsten Schmidt,et al.  Forecasting private consumption: survey‐based indicators vs. Google trends , 2011 .

[17]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[18]  Torsten Schmidt,et al.  Forecasting Private Consumption: Survey-Based Indicators vs. Google Trends , 2009 .