Big enterprise registration data imputation: Supporting spatiotemporal analysis of industries in China

Big, fine-grained enterprise registration data that includes time and location information enables us to quantitatively analyze, visualize, and understand the patterns of industries at multiple scales across time and space. However, data quality issues like incompleteness and ambiguity, hinder such analysis and application. These issues become more challenging when the volume of data is immense and constantly growing. High Performance Computing (HPC) frameworks can tackle big data computational issues, but few studies have systematically investigated imputation methods for enterprise registration data in this type of computing environment. In this paper, we propose a big data imputation workflow based on Apache Spark as well as a bare-metal computing cluster, to impute enterprise registration data. We integrated external data sources, employed Natural Language Processing (NLP), and compared several machine-learning methods to address incompleteness and ambiguity problems found in enterprise registration data. Experimental results illustrate the feasibility, efficiency, and scalability of the proposed HPC-based imputation framework, which also provides a reference for other big georeferenced text data processing. Using these imputation results, we visualize and briefly discuss the spatiotemporal distribution of industries in China, demonstrating the potential applications of such data when quality issues are resolved.

[1]  L. Ying,et al.  Spatio-Temporal Changes of Population Density and Urbanization Pattern in China ( 2000 – 2010 ) , 2016 .

[2]  Huan Liu,et al.  SlangSD: Building and Using a Sentiment Dictionary of Slang Words for Short-Text Sentiment Classification , 2016, ArXiv.

[3]  Shengjun Zhu,et al.  Do China’s economic development zones improve land use efficiency? The effects of selection, factor accumulation and agglomeration , 2017 .

[4]  Ying Long,et al.  Transformations of urban studies and planning in the big/open data era: a review , 2016 .

[5]  Craig A. Knoblock,et al.  From Text to Geographic Coordinates: The Current State of Geocoding , 2007 .

[6]  Henry G. Overman,et al.  Testing for Localisation Using Micro-Geographic Data , 2002 .

[7]  G. Giuliano,et al.  SUBCENTERS IN THE LOS ANGELES REGION , 1991 .

[8]  Laurent Gobillon,et al.  Estimating Agglomeration Economies with History, Geology, and Worker Effects , 2008 .

[9]  Andrew Watkins The spatial distribution of economic activity in Melbourne, 1971–2006 , 2014 .

[10]  S. P. Syed Ibrahim,et al.  Twitter Data Classification Using Side Information , 2016 .

[11]  Luciana Lazzeretti,et al.  Creative clusters in Europe: a microdata approach , 2011 .

[12]  Mieczyslaw A. Klopotek,et al.  A New Bayesian Tree Learning Method with Reduced Time and Space Complexity , 2002, Fundam. Informaticae.

[13]  Hassan A. Karimi,et al.  Comparative evaluation and analysis of online geocoding services , 2010, Int. J. Geogr. Inf. Sci..

[14]  Reagan Moore,et al.  Data-intensive computing , 1998 .

[15]  Michael F. Goodchild,et al.  Introduction to digital gazetteer research , 2008, Int. J. Geogr. Inf. Sci..

[16]  Ping Jian Zhang,et al.  A Text Categorization Method Based on Features Clustering , 2012 .

[17]  Bruno Martins,et al.  Automated Geocoding of Textual Documents: A Survey of Current Approaches , 2017, Trans. GIS.

[18]  D. Puga,et al.  THE MAGNITUDE AND CAUSES OF AGGLOMERATION ECONOMIES∗ , 2009 .

[19]  Xingjian Liu,et al.  Featured Graphic. Mushrooming Jiedaos, Growing Cities: An Alternative Perspective on Urbanizing China , 2015 .

[20]  Wenzhong Zhang,et al.  The spatial distribution of industries in transitional China: A study of Beijing , 2015 .

[21]  Chen Xiaojian,et al.  Optimizing Urban Spatial Structure of Lanzhou Based on Geographic Concentration Method of Industries , 2007 .

[22]  Paul A. Zandbergen,et al.  A comparison of address point, parcel and street geocoding techniques , 2008, Comput. Environ. Urban Syst..

[23]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[24]  M. Bennett,et al.  Advances in using multitemporal night-time lights satellite imagery to detect, estimate, and monitor socioeconomic dynamics , 2017 .

[25]  Kun Chang Lee,et al.  Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets , 2016, Expert Syst. Appl..

[26]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[27]  James A. Thom,et al.  Geotagging Twitter Messages in Crisis Management , 2015, Comput. J..

[28]  Francisco Herrera,et al.  Tutorial on practical tips of the most influential data preprocessing algorithms in data mining , 2016, Knowl. Based Syst..

[29]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[30]  Florence Puech,et al.  Measures of the geographic concentration of industries: improving distance-based methods , 2010 .

[31]  Chaowei Yang,et al.  Utilizing Cloud Computing to address big geospatial data challenges , 2017, Comput. Environ. Urban Syst..

[32]  Eric Horvitz,et al.  Eyewitness: identifying local events via space-time signals in twitter feeds , 2015, SIGSPATIAL/GIS.

[33]  Giuseppe Arbia,et al.  Modelling the geography of economic activities on a continuous space , 2001 .

[34]  M. Kulldorff,et al.  Using Imputation to Provide Location Information for Nongeocoded Addresses , 2010, PloS one.

[35]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[36]  Feng Zhu,et al.  Experience report: A characteristic study on out of memory errors in distributed data-parallel applications , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[37]  Kevin S. McCurley,et al.  Geospatial mapping and navigation of the web , 2001, WWW '01.

[38]  Du Qingyun,et al.  A New Method of Chinese Address Extraction Based on Address Tree Model , 2015 .

[39]  Zhenlong Li,et al.  Big Data and cloud computing: innovation opportunities and challenges , 2017, Int. J. Digit. Earth.

[40]  Henryk Maciejewski,et al.  Distributed Classification of Text Documents on Apache Spark Platform , 2016, ICAISC.

[41]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[42]  Nan Sun,et al.  Exploiting internal and external semantics for the clustering of short texts using world knowledge , 2009, CIKM.

[43]  Chunyang He,et al.  How does sprawl differ across cities in China? A multi-scale investigation using nighttime light and census data , 2016 .

[44]  Mingshu Wang,et al.  How polycentric is urban China and why? A case study of 318 cities , 2016 .

[45]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[46]  John B. Parr,et al.  The Regional Economy, Spatial Structure and Regional Urban Systems , 2014 .

[47]  B. Derudder,et al.  Measuring Polycentric Urban Development in China: An Intercity Transportation Network Perspective , 2016 .

[48]  Yang Li,et al.  Population distribution and urbanization on both sides of the Hu Huanyong Line: Answering the Premier’s question , 2016, Journal of Geographical Sciences.

[49]  Moritz Lennert,et al.  The Use of Exhaustive Micro-Data Firm Databases for Economic Geography: The Issues of Geocoding and Usability in the Case of the Amadeus Database , 2015, ISPRS Int. J. Geo Inf..

[50]  Qunying Huang,et al.  From where do tweets originate?: a GIS approach for user location inference , 2014, LBSN '14.

[51]  Jose Miguel Puerta,et al.  Learning distributed discrete Bayesian Network Classifiers under MapReduce with Apache Spark , 2017, Knowl. Based Syst..

[52]  Giuseppe Arbia,et al.  Weighting Ripley’s K-Function to Account for the Firm Dimension in the Analysis of Spatial Concentration , 2014 .

[53]  David W. S. Wong,et al.  Evaluating the “geographical awareness” of individuals: an exploratory analysis of twitter data , 2013, Cartography and Geographic Information Science.

[54]  Heng Zhang,et al.  Improving short text classification by learning vector representations of both words and hidden topics , 2016, Knowl. Based Syst..

[55]  Xiaolin Du,et al.  Short Text Classification: A Survey , 2014, J. Multim..

[56]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[57]  Tae-Sun Chung,et al.  Document Classification Using Word2Vec and Chi-square on Apache Spark , 2016, CSA/CUTE.

[58]  Chaogui Kang,et al.  Social Sensing: A New Approach to Understanding Our Socioeconomic Environments , 2015 .

[59]  Peter K. Schott,et al.  The Empirics of Firm Heterogeneity and International Trade , 2011 .

[60]  Dhabaleswar K. Panda,et al.  Data intensive computing , 2006, SC.