Scalability and Parallelization of Sequential Processing: Big Data Demands and Information Algebras

Procedures of sequential updating of information are important for “big data streams” processing because they avoid accumulating and storing large data sets. As a model of information accumulation, we study the Bayesian updating procedure for linear experiments. Analysis and gradual transformation of the original processing scheme in order to increase its efficiency lead to certain mathematical structures - information spaces. We show that processing can be simplified by introducing a special intermediate form of information representation. Thanks to the rich algebraic properties of the corresponding information space, it allows unifying and increasing the efficiency of the information updating. It also leads to various parallelization options for inherently sequential Bayesian procedure, which are suited for distributed data processing platforms, such as MapReduce. Besides, we will see how certain formalization of the concept of information and its algebraic properties can arise simply from adopting data processing to big data demands. Approaches and concepts developed in the paper allow to increase efficiency and uniformity of data processing and present a systematic approach to transforming sequential processing into parallel.

[1]  Mamoni Dhar,et al.  Theory of Fuzzy Sets: An Overview , 2013 .

[2]  Oleksii K. Tyshchenko,et al.  An Ensemble of Adaptive Neuro-Fuzzy Kohonen Networks for Online Data Stream Fuzzy Clustering , 2016, ArXiv.

[3]  P. V. Golubtsov The Linear Estimation Problem and Information in Big-Data Systems , 2018 .

[4]  David Lindley,et al.  Bayesian Statistics, a Review , 1987 .

[5]  David J. Spiegelhalter,et al.  Bayesian analysis in expert systems , 1993 .

[6]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[7]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[8]  Christian P. Robert,et al.  On the Relevance of the Bayesian Approach to Statistics , 2009, Review of Economic Analysis.

[9]  D. Lindley,et al.  Bayes Estimates for the Linear Model , 1972 .

[10]  Another conjugate family for the normal distribution , 1986 .

[11]  Sean Owen,et al.  Advanced Analytics with Spark: Patterns for Learning from Data at Scale , 2015 .

[12]  Jun Zhu,et al.  Big Learning with Bayesian Methods , 2014, ArXiv.

[13]  Md. Ahsan Habib,et al.  A study and Performance Comparison of MapReduce and Apache Spark on Twitter Data on Hadoop Cluster , 2018, International Journal of Information Technology and Computer Science.

[14]  Joachim Vandekerckhove,et al.  Sequential bayesian updating for big data , 2016 .

[15]  David J. Spiegelhalter,et al.  Sequential updating of conditional probabilities on directed graphical structures , 1990, Networks.

[16]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[17]  P. Diaconis,et al.  Conjugate Priors for Exponential Families , 1979 .

[18]  Yu P Pyt'ev,et al.  PSEUDOINVERSE OPERATORS. PROPERTIES AND APPLICATIONS , 1983 .

[19]  Matthew He,et al.  Mathematics of Bioinformatics : Theory, Practice, and Applications , 2010 .

[20]  P. V. Golubtsov Algebra of Information in Big Data Processing , 2017 .

[21]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[22]  A. Vasudevan,et al.  On the a priori and a posteriori assessment of probabilities , 2013, J. Appl. Log..

[23]  Marina Kholod,et al.  An Examination of the Impact of Neurophysiologic and Environmental Variables on Shopping Behavior of Customers in a Grocery Store in Japan , 2012, KES.

[24]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[25]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[26]  B. K. Tripathy,et al.  Fuzzy Clustering of Sequential Data , 2019, International Journal of Intelligent Systems and Applications.

[27]  Yu P Pyt'ev REDUCTION PROBLEMS IN EXPERIMENTAL INVESTIGATIONS , 1984 .

[28]  Siddharth Swarup Rautaray,et al.  Big Data Optimization Techniques: A Survey , 2018, International Journal of Information Engineering and Electronic Business.

[29]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[30]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.