Big Data? More Challenges!

Recent advances in data acquisition technologies have led to massive amount of data being collected routinely in the physical, chemical, and engineering sciences as well as information sciences and technology. In addition to volume, the data often have complicated structure. Examples of such Big Data include the data streams obtained from complex engineering systems, image sequences, climate data, website transaction logs, credit card records, and so forth. Because of their big volume and complicated structure, big data are difficult to handle using traditional database management and statistical analysis tools. They create many new challenges for statisticians to describe and analyze them properly. To face the challenges and promote new statistical methods in handling big data, Technometrics decided to have a special issue on that topic in late 2013, and a guest editorial board was established soon after the decision. The board includes Drs. Ming-Hui Chen, Radu V. Craiu, Robert B. Gramacy, Willis A. Jensen, Faming Liang, Chuanhai Liu, and William Q. Meeker as associate editors, and me as editor. The Call for Papers was published in the journal and some other media in early 2014. We received 23 high-quality submissions before the deadline. All submissions went through the regular review procedure of the journal. Besides the people in the guest editorial board, some associate editors on the regular editorial board of the journal also helped handle some submissions. Finally, 11 articles were selected to publish in the special issue, which cover a wide range of topics in describing, analyzing and computing big data. These articles are briefly discussed below. The first five articles proposed numerical algorithms that can analyze big data fast. In “Orthogonalizing EM: A Design-Based Least Squares Algorithm,” Shifeng Xiong, Bin Dai, Jared Huling, and Peter Z. G. Qian, propose an efficient iterative algorithm intended for various least squares problems, based on a design of experiments perspective. The algorithm, called orthogonalizing EM (OEM), works for ordinary least squares and can be extended easily to penalized least squares. The main idea of the procedure is to orthogonalize a design matrix by adding new rows and then solve the original problem by embedding the augmented design in a missing data framework. In “Speeding Up Neighborhood Search in Local Gaussian Process Prediction” by Robert B. Gramacy and Benjamin Haaland, the authors suggest an algorithm for speeding up neighborhood search in local Gaussian process prediction that is commonly used in various nonlinear and nonparametric prediction problems, particularly when deployed as emulators for computer experiments. The third article titled “A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data” by Faming Liang, Jinsu Kim, and Qifan Song proposes a so-called bootstrap MetropolisHastings (BMH) algorithm that provides a general framework to tame powerful MCMC methods for big data analysis. The main idea of the algorithm is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples. The fourth article, “Compressing an Ensemble With Statistical Models: An Algorithm for Global 3D Spatio-Temporal Temperature,” by Stefano Castruccio and Marc G. Genton suggests an algorithm for compressing 3D spatio-temporal temperature using statistics-based approach that explicitly accounted for the space-time dependence of the data. The fifth article, “Partitioning a Large Simulation as It Runs” by Kary Myers, Earl Lawrence, Michael Fugate, Claire McKay Bowen, Lawrence Ticknor, Jon Woodring, Joanne Wendelberger, and Jim Ahrens covers analysis of data streams, in which data were generated sequentially and data storage, transferring, and analysis were all challenging. The authors suggested a so-called online in situ method for identifying a reduced set of time steps of the data and data analysis results to save in the storage facility, in order to significantly reduce the data transfer and storage requirements. The next two articles are about machine learning methods for handling big data. In the first article, “High-Performance Kernel Machines With Implicit Distributed Optimization and Randomization,” Haim Avron and Vikas Sindhwani propose a framework for massive-scale training of kernel-based statistical models, based on combining distributed convex optimization with randomization techniques. The second article, “Statistical Learning of Neuronal Functional Connectivity,” by Chunming Zhang, Yi Chai, Xiao Guo, Muhong Gao, David Devilbiss, and Zhengjun Zhang identifies the network structure of a neuron ensemble beyond the standard measure of pairwise correlations, which is critical for understanding how information is transferred within such a neural population. The spike train data posed a significant challenge to conventional statistical methods due to not only the complexity, massive size, and large scale, but also the high dimensionality. In this article, the authors propose a novel “structural information enhanced” (SIE)