Box–Cox Transformation in Big Data

ABSTRACT The Box–Cox transformation is an important technique in linear regression when assumptions of a regression model are seriously violated. The technique has been widely accepted and extensively applied since it was first proposed. Based on the maximum likelihood approach, previous methods and algorithms for the Box–Cox transformation are mostly developed for small or moderate data. These methods and algorithms cannot be applied to big data because of the memory and storage capacity barriers. To overcome these difficulties, the present article proposes new methods and algorithms, where the basic idea is to construct and compute a set of summary statistics, which is termed as the Box–Cox information array in the article. According to the property of the maximum likelihood approach, the computation of the Box–Cox information array is the only issue to be considered in reading of data. Once the Box–Cox information array is obtained, the optimal power transformation as well as the corresponding estimates of model parameters can be quickly computed. Since the whole dataset is scanned only once, the proposed methods and algorithms can be extremely efficient and fast even when multiple models are considered. It is expected that the basic knowledge gained in this article will have a great impact on the development of statistical methods and algorithms for big data.

[1]  H. Wen,et al.  The Application of Box-Cox Transformation in Selecting Functional Form for Hedonic Price Models , 2013 .

[2]  Clive W. J. Granger,et al.  Experience with using the Box-Cox transformation when forecasting economic time series , 1979 .

[3]  V. Lemaître,et al.  DELPHES, a framework for fast simulation of a generic collider experiment , 2009, 0903.2225.

[4]  F. Maltoni,et al.  MadGraph 5: going beyond , 2011, 1106.0522.

[5]  G. Reinsel,et al.  Introduction to Mathematical Statistics (4th ed.). , 1980 .

[6]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[7]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[8]  Donald. Miner,et al.  MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems , 2012 .

[9]  .. W. V. Der,et al.  On Profile Likelihood , 2000 .

[10]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[13]  J. Mahnken,et al.  The Box-Cox power transformation on nursing sensitive indicators: Does it matter if structural effects are omitted during the estimation of the transformation parameter? , 2011, BMC medical research methodology.

[14]  Li-Hsing Ho,et al.  Using Modified IPA to Improve Service Quality of Standard Hotel in Taiwan , 2014 .

[15]  Yixin Chen,et al.  Regression Cubes with Lossless Compression and Aggregation , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  John W. Emerson,et al.  Don't drown in the data , 2012 .

[17]  Dajun Du,et al.  Overload Risk Assessment in Grid-Connected Induction Wind Power System , 2012 .

[18]  María José del Jesús,et al.  Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks , 2014, WIREs Data Mining Knowl. Discov..

[19]  W. Patefield ON THE MAXIMIZED LIKELIHOOD FUNCTION , 2016 .

[20]  Ruibin Xi,et al.  Aggregated estimating equation estimation , 2011 .

[21]  H. Murat Celik Forecasting interregional commodity flows using artificial neural networks: an evaluation , 2004 .

[22]  Dean P. Foster,et al.  New Subsampling Algorithms for Fast Least Squares Regression , 2013, NIPS.

[23]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[24]  D. Hinkley,et al.  The Analysis of Transformed Data , 1984 .

[25]  Xiaoxiao Sun,et al.  Leveraging for big data regression , 2015 .

[26]  Yili Hong,et al.  Reliability Meets Big Data: Opportunities and Challenges , 2014 .

[27]  Mohammad Zakir Hossain,et al.  The use of box-cox transformation technique in economic and statistical analyses , 2011 .

[28]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[29]  J. Osborne Improving your data transformations: Applying the Box-Cox transformation , 2010 .

[30]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[31]  Bowei Xi,et al.  Large complex data: divide and recombine (D&R) with RHIPE , 2012 .

[32]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[33]  N. Draper,et al.  An Alternative Family of Transformations , 1980 .

[34]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[35]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[36]  Luis R. Pericchi,et al.  A Bayesian approach to transformations to normality , 1981 .

[37]  S. Mrenna,et al.  Pythia 6.3 physics and manual , 2003, hep-ph/0308153.

[38]  Ryan Hafen,et al.  Automated Box-Cox Transformations for Improved Visual Encoding , 2013, IEEE Transactions on Visualization and Computer Graphics.

[39]  Trevor J. Sweeting,et al.  ON THE CHOICE OF PRIOR DISTRIBUTION FOR THE BOX-COX TRANSFORMED LINEAR-MODEL , 1984 .

[40]  Samuel DiGangi,et al.  Assessing unidimensionality: A comparison of Rasch modeling, Parallel analysis, and TETRAD , 2007 .

[41]  C. J. Wilcox,et al.  Technical note: Application of the Box-Cox data transformation to animal science experiments. , 1998, Journal of animal science.