Large complex data: divide and recombine (D&R) with RHIPE

D&R is a new statistical approach to the analysis of large complex data. The data are divided into subsets. Computationally, each subset is a small dataset. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with no communication among them, making them embarrassingly parallel, the simplest possible parallel processing. Using D&R, a data analyst can apply almost any statistical or visualization method to large complex data. Direct application of most analytic methods to the entire data is either infeasible, or impractical. D&R enables deep analysis: comprehensive analysis, including visualization of the detailed data, that minimizes the risk of losing important information. One of our D&R research thrusts uses statistics to develop “best” division and recombination procedures for analytic methods. Another is a D&R computational environment that has two widely used components, R and Hadoop, and our RHIPE merger of them. Hadoop is a distributed database and parallel compute engine that executes the embarrassingly parallel D&R computations across a cluster. RHIPE allows analysis wholly from within R, making programming with the data very efficient. Copyright © 2012 John Wiley & Sons, Ltd.

[1]  Deepayan Sarkar,et al.  Lattice: Multivariate Data Visualization with R , 2008 .

[2]  G. Glass Primary, Secondary, and Meta-Analysis of Research , 2008 .

[3]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[4]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[5]  William S. Cleveland,et al.  A Streaming Statistical Algorithm for Detection of SSH Keystroke Packets in TCP Connections , 2011, ICS 2011.

[6]  Peter L. Brooks,et al.  Visualizing data , 1997 .

[7]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[8]  Hui Chen,et al.  Statistical analysis and modeling of Internet VoIP traffic for network engineering , 2010 .

[9]  Richard A. Becker,et al.  The Visual Design and Control of Trellis Display , 1996 .

[10]  Ryan Hafen,et al.  Visualization Databases for the Analysis of Large Complex Datasets , 2009, AISTATS.

[11]  William S. Cleveland,et al.  Computing environment for the statistical analysis of large and complex data , 2010 .

[12]  F. J. Anscombe,et al.  The Examination and Analysis of Residuals , 1963 .

[13]  M. Crawley Mixed‐Effects Models , 2007 .

[14]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[15]  Bowei Xi,et al.  Trellis display for modeling data from designed experiments , 2011, Stat. Anal. Data Min..

[16]  J. W. Gorman,et al.  Fitting Equations to Data. , 1973 .

[17]  V. Carey,et al.  Mixed-Effects Models in S and S-Plus , 2001 .

[18]  Gabor Grothendieck,et al.  Lattice: Multivariate Data Visualization with R , 2008 .

[19]  G. Glass Primary, Secondary, and Meta-Analysis of Research1 , 1976 .

[20]  Ryan Hafen,et al.  Syndromic surveillance: STL for modeling, visualizing, and monitoring disease counts , 2009, BMC Medical Informatics Decis. Mak..

[21]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[22]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[23]  John W. Tukey,et al.  Another Look at the Future , 1983 .