论文信息 - Large complex data: divide and recombine (D&R) with RHIPE - 字舞流文

Large complex data: divide and recombine (D&R) with RHIPE

D&R is a new statistical approach to the analysis of large complex data. The data are divided into subsets. Computationally, each subset is a small dataset. Analytic methods are applied to each of the subsets, and the outputs of each method are recombined to form a result for the entire data. Computations can be run in parallel with no communication among them, making them embarrassingly parallel, the simplest possible parallel processing. Using D&R, a data analyst can apply almost any statistical or visualization method to large complex data. Direct application of most analytic methods to the entire data is either infeasible, or impractical. D&R enables deep analysis: comprehensive analysis, including visualization of the detailed data, that minimizes the risk of losing important information. One of our D&R research thrusts uses statistics to develop “best” division and recombination procedures for analytic methods. Another is a D&R computational environment that has two widely used components, R and Hadoop, and our RHIPE merger of them. Hadoop is a distributed database and parallel compute engine that executes the embarrassingly parallel D&R computations across a cluster. RHIPE allows analysis wholly from within R, making programming with the data very efficient. Copyright © 2012 John Wiley & Sons, Ltd.

Bowei Xi | William S. Cleveland | Ryan P. Hafen | Jin Xia | Jeremiah Rounds | Jianfu Li | Saptarshi Guha | W. Cleveland | B. Xi | Jian-Fu Li | R. Hafen | J. Rounds | Saptarshi Guha | Jin Xia

[1] Deepayan Sarkar,et al. Lattice: Multivariate Data Visualization with R , 2008 .

[2] G. Glass. Primary, Secondary, and Meta-Analysis of Research , 2008 .

[3] S. Weisberg,et al. Residuals and Influence in Regression , 1982 .

[4] W. Cleveland,et al. Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[5] William S. Cleveland,et al. A Streaming Statistical Algorithm for Detection of SSH Keystroke Packets in TCP Connections , 2011, ICS 2011.

[6] Peter L. Brooks,et al. Visualizing data , 1997 .

[7] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[8] Hui Chen,et al. Statistical analysis and modeling of Internet VoIP traffic for network engineering , 2010 .

[9] Richard A. Becker,et al. The Visual Design and Control of Trellis Display , 1996 .

[10] Ryan Hafen,et al. Visualization Databases for the Analysis of Large Complex Datasets , 2009, AISTATS.

[11] William S. Cleveland,et al. Computing environment for the statistical analysis of large and complex data , 2010 .

[12] F. J. Anscombe,et al. The Examination and Analysis of Residuals , 1963 .

[13] M. Crawley. Mixed‐Effects Models , 2007 .

[14] W. W. Muir,et al. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[15] Bowei Xi,et al. Trellis display for modeling data from designed experiments , 2011, Stat. Anal. Data Min..

[16] J. W. Gorman,et al. Fitting Equations to Data. , 1973 .

[17] V. Carey,et al. Mixed-Effects Models in S and S-Plus , 2001 .

[18] Gabor Grothendieck,et al. Lattice: Multivariate Data Visualization with R , 2008 .

[19] G. Glass. Primary, Secondary, and Meta-Analysis of Research1 , 1976 .

[20] Ryan Hafen,et al. Syndromic surveillance: STL for modeling, visualizing, and monitoring disease counts , 2009, BMC Medical Informatics Decis. Mak..

[21] John W. Tukey,et al. Exploratory Data Analysis. , 1979 .

[22] Jeffrey Scott Vitter,et al. External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[23] John W. Tukey,et al. Another Look at the Future , 1983 .