Measuring Housing Vitality from Multi-Source Big Data and Machine Learning

Abstract Measuring timely high-resolution socioeconomic outcomes is critical for policymaking and evaluation, but hard to reliably obtain. With the help of machine learning and cheaply available data such as social media and nightlight, it is now possible to predict such indices in fine granularity. This article demonstrates an adaptive way to measure the time trend and spatial distribution of housing vitality (number of occupied houses) with the help of multiple easily accessible datasets: energy, nightlight, and land-use data. We first identified the high-frequency housing occupancy status from energy consumption data and then matched it with the monthly nightlight data. We then introduced the Factor-Augmented Regularized Model for prediction (FarmPredict) to deal with the dependence and collinearity issue among predictors by effectively lifting the prediction space, which is suitable to most machine learning algorithms. The heterogeneity issue in big data analysis is mitigated through the land-use data. FarmPredict allows us to extend the regional results to the city level, with a 76% out-of-sample explanation of the spatial and timeliness variation in the house usage. Since energy is indispensable for life, our method is highly transferable with the only requirement of publicly accessible data. Our article provides an alternative approach with statistical machine learning to predict socioeconomic outcomes without the reliance on existing census and survey data. Supplementary materials for this article are available online.

[1]  Michele Peruzzi,et al.  Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains , 2020, Journal of the American Statistical Association.

[2]  Jianqing Fan,et al.  How Much Can Machines Learn Finance From Chinese Text Data? , 2021 .

[3]  Jay Taneja,et al.  Indicators of Electric Power Instability from Satellite Observed Nighttime Lights , 2020, Remote. Sens..

[4]  Runze Li,et al.  Statistical Foundations of Data Science , 2020 .

[5]  Akhilesh Kumar Singh,et al.  Clustering Evaluation by Davies-Bouldin Index(DBI) in Cereal data using K-Means , 2020, 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC).

[6]  Jianqing Fan,et al.  Estimating Number of Factors by Adjusted Eigenvalues Thresholding , 2019, Journal of the American Statistical Association.

[7]  Adam D. Nowak,et al.  Quality-Adjusted House Price Indexes , 2019, American Economic Review: Insights.

[8]  Jianqing Fan,et al.  Factor-Adjusted Regularized Model Selection , 2016, Journal of econometrics.

[9]  Carlo Ratti,et al.  Predicting neighborhoods’ socioeconomic attributes using restaurant data , 2019, Proceedings of the National Academy of Sciences.

[10]  Andrew O. Finley,et al.  Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping. , 2018, Statistica Sinica.

[11]  W. Pizer,et al.  Climate change and residential electricity consumption in the Yangtze River Delta, China , 2018, Proceedings of the National Academy of Sciences.

[12]  Weibo Xiong,et al.  China&Apos;S Real Estate Market , 2018 .

[13]  Sudipto Banerjee,et al.  Web Appendix: Meta-Kriging: Scalable Bayesian Modeling and Inference for Massive Spatial Datasets , 2018 .

[14]  Joakim Widén,et al.  Residential probabilistic load forecasting: A method using Gaussian process designed for electric load data , 2018 .

[15]  Warren C. Jochem,et al.  Spatially disaggregated population estimates in the absence of national population and housing census data , 2018, Proceedings of the National Academy of Sciences.

[16]  E. Glaeser,et al.  Nowcasting Gentrification: Using Yelp Data to Quantify Neighborhood Change , 2018 .

[17]  Michael Luca,et al.  Nowcasting the Local Economy: Using Yelp Data to Measure Economic Activity , 2017 .

[18]  Sudipto Banerjee,et al.  High-Dimensional Bayesian Geostatistics. , 2017, Bayesian analysis.

[19]  Jonathan Krause,et al.  Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States , 2017, Proceedings of the National Academy of Sciences.

[20]  Wei Huang,et al.  A Real Estate Boom with Chinese Characteristics , 2016 .

[21]  Chuanchuan Zhang,et al.  Housing affordability and housing vacancy in China: The role of income inequality , 2016 .

[22]  Sang Michael Xie,et al.  Combining satellite imagery and machine learning to predict poverty , 2016, Science.

[23]  Adam Mann,et al.  Core Concept: Computational social science , 2016, Proceedings of the National Academy of Sciences.

[24]  Hanming Fang,et al.  Demystifying the Chinese Housing Boom , 2015, NBER Macroeconomics Annual.

[25]  Sudipto Banerjee,et al.  Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets , 2014, Journal of the American Statistical Association.

[26]  Michael Luca,et al.  Big Data and Big Cities: The Promises and Limitations of Improved Measures of Urban Life , 2015 .

[27]  David Lazer,et al.  Tracking employment shocks using mobile phone data , 2015, Journal of The Royal Society Interface.

[28]  Jianping Wu,et al.  Estimating House Vacancy Rate in Metropolitan Areas Using NPP-VIIRS Nighttime Light Composite Data , 2015, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[29]  Jonathan Levin,et al.  Economics in the age of big data , 2014, Science.

[30]  A. Tatem,et al.  Dynamic population mapping using mobile phone data , 2014, Proceedings of the National Academy of Sciences.

[31]  Yi Wen,et al.  The Great Housing Boom of China , 2014 .

[32]  C. Elvidge,et al.  Why VIIRS data are superior to DMSP for mapping nighttime lights , 2013 .

[33]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[34]  Zhidong Bai,et al.  ESTIMATION OF SPIKED EIGENVALUES IN SPIKED MODELS , 2012 .

[35]  C. Mayer Housing Bubbles: A Survey , 2011 .

[36]  W. Nordhaus,et al.  Using luminosity data as a proxy for economic statistics , 2011, Proceedings of the National Academy of Sciences.

[37]  Lada A. Adamic,et al.  Computational Social Science , 2009, Science.

[38]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[39]  J. Stock,et al.  Forecasting Using Principal Components From a Large Number of Predictors , 2002 .

[40]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[41]  J. Bai,et al.  Determining the Number of Factors in Approximate Factor Models , 2000 .

[42]  Burton H. Singer,et al.  Recursive partitioning in the health sciences , 1999 .

[43]  W. Wheaton,et al.  Vacancy, Search, and Prices in a Housing Market Matching Model , 1990, Journal of Political Economy.