EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data

In this work, we introduce an innovative Markov Chain Monte Carlo (MCMC) classifier, a synergistic combination of Bayesian machine learning and Apache Spark, highlighting the novel use of this methodology in the spectrum of big data management and environmental analysis. By employing a large dataset of air pollutant concentrations in Madrid from 2001 to 2018, we developed a Bayesian Logistic Regression model, capable of accurately classifying the Air Quality Index (AQI) as safe or hazardous. This mathematical formulation adeptly synthesizes prior beliefs and observed data into robust posterior distributions, enabling superior management of overfitting, enhancing the predictive accuracy, and demonstrating a scalable approach for large-scale data processing. Notably, the proposed model achieved a maximum accuracy of 87.91% and an exceptional recall value of 99.58% at a decision threshold of 0.505, reflecting its proficiency in accurately identifying true negatives and mitigating misclassification, even though it slightly underperformed in comparison to the traditional Frequentist Logistic Regression in terms of accuracy and the AUC score. Ultimately, this research underscores the efficacy of Bayesian machine learning for big data management and environmental analysis, while signifying the pivotal role of the first-ever MCMC Classifier and Apache Spark in dealing with the challenges posed by large datasets and high-dimensional data with broader implications not only in sectors such as statistics, mathematics, physics but also in practical, real-world applications.

[1]  S. Sioutas,et al.  Consensus Big Data Clustering for Bayesian Mixture Models , 2023, Algorithms.

[2]  N. Schizas,et al.  AutoML with Bayesian Optimizations for Big Data Management , 2023, Inf..

[3]  Sheikh Jubair,et al.  Crop genomic selection with deep learning and environmental data: A survey , 2023, Frontiers in Artificial Intelligence.

[4]  N. Schizas,et al.  TinyML for Ultra-Low Power AI and Large Scale IoT Deployments: A Systematic Review , 2022, Future Internet.

[5]  Radu V. Craiu,et al.  Approximate Methods for Bayesian Computation , 2022, Annual Review of Statistics and Its Application.

[6]  M. Nassar,et al.  Computational Analysis of XLindley Parameters Using Adaptive Type-II Progressive Hybrid Censoring with Applications in Chemical Engineering , 2022, Mathematics.

[7]  Xingxing Wei,et al.  Stochastic stratigraphic modeling using Bayesian machine learning , 2022, Engineering Geology.

[8]  E. H. Hafez,et al.  Power-Modified Kies-Exponential Distribution: Properties, Classical and Bayesian Inference with an Application to Engineering Data , 2022, Entropy.

[9]  Ping Fu,et al.  Lightweight Self-Detection and Self-Calibration Strategy for MEMS Gas Sensor Arrays , 2022, Sensors.

[10]  A. Fascista Toward Integrated Large-Scale Environmental Monitoring Using WSN/UAV/Crowdsensing: A Review of Applications, Signal Processing, and Future Perspectives , 2022, Sensors.

[11]  J. Yeomans,et al.  Sustainability Analysis and Environmental Decision-Making Using Simulation, Optimization, and Computational Analytics , 2022, Sustainability.

[12]  J. Saffell,et al.  Sampling and analysis techniques for inorganic air pollutants in indoor air , 2021, Applied Spectroscopy Reviews.

[13]  Awny Sayed,et al.  Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis , 2021, Informatics.

[14]  Md. Romyull Islam,et al.  Machine Learning Techniques Applied to Predict Tropospheric Ozone in a Semi-Arid Climate Region , 2021, Mathematics.

[15]  Berihan R. Elemary,et al.  Analysis for Xgamma Parameters of Life under Type-II Adaptive Progressively Hybrid Censoring with Applications in Engineering and Chemistry , 2021, Symmetry.

[16]  Xue-bo Jin,et al.  Deep-Learning Temporal Predictor via Bidirectional Self-Attentive Encoder–Decoder Framework for IOT-Based Environmental Sensing in Intelligent Greenhouse , 2021, Agriculture.

[17]  C. Sánchez,et al.  Relationship between air pollution levels in Madrid and the natural history of idiopathic pulmonary fibrosis: severity and mortality , 2021, The Journal of international medical research.

[18]  J. Møller,et al.  MCMC Computations for Bayesian Mixture Models Using Repulsive Point Processes , 2020, J. Comput. Graph. Stat..

[19]  Shahab S. Band,et al.  Evaluating the Efficiency of Different Regression, Decision Tree, and Bayesian Machine Learning Algorithms in Spatial Piping Erosion Susceptibility Using ALOS/PALSAR Data , 2020 .

[20]  Stephen R. Green,et al.  Complete parameter inference for GW150914 using deep learning , 2020, Mach. Learn. Sci. Technol..

[21]  Satwinder Singh,et al.  Comparison and analysis of logistic regression, Naïve Bayes and KNN machine learning algorithms for credit card fraud detection , 2020, International Journal of Information Technology.

[22]  A. McNabola,et al.  A Functional Data Analysis Approach for the Detection of Air Pollution Episodes and Outliers: A Case Study in Dublin, Ireland , 2020, Mathematics.

[23]  Brian Munsky,et al.  BAYESIAN INFERENCE OF STOCHASTIC REACTION NETWORKS USING MULTIFIDELITY SEQUENTIAL TEMPERED MARKOV CHAIN MONTE CARLO. , 2020, International journal for uncertainty quantification.

[24]  P. Young,et al.  Data Science of the Natural Environment: A Research Roadmap , 2019, Front. Environ. Sci..

[25]  Mohamed Boualem,et al.  Analysis of a non-Markovian queueing model: Bayesian statistics and MCMC methods , 2019, Monte Carlo Methods Appl..

[26]  Daniel Paulin,et al.  Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting , 2019, J. Mach. Learn. Res..

[27]  Hang Lei,et al.  Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization , 2019 .

[28]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[29]  Scott D. Brown,et al.  A simple introduction to Markov Chain Monte–Carlo sampling , 2016, Psychonomic bulletin & review.

[30]  Peter Steen Mikkelsen,et al.  Comparison of two stochastic techniques for reliable urban runoff prediction by modeling systematic errors , 2015 .

[31]  M. N. Sulaiman,et al.  A Review On Evaluation Metrics For Data Classification Evaluations , 2015 .

[32]  Alexandros G. Dimakis,et al.  Optimized Markov Chain Monte Carlo for Signal Detection in MIMO Systems: An Analysis of the Stationary Distribution and Mixing Time , 2013, IEEE Transactions on Signal Processing.

[33]  G. Roberts,et al.  MCMC Methods for Functions: ModifyingOld Algorithms to Make Them Faster , 2012, 1202.0709.

[34]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[35]  Lambros S. Katafygiotis,et al.  Modified Metropolis–Hastings algorithm with delayed rejection , 2011 .

[36]  N. Shephard,et al.  BAYESIAN INFERENCE BASED ONLY ON SIMULATED LIKELIHOOD: PARTICLE FILTER ANALYSIS OF DYNAMIC ECONOMIC MODELS , 2011, Econometric Theory.

[37]  Jiaqiu Wang,et al.  A Hybrid Framework for Space-Time Modeling of Environmental Data , 2011 .

[38]  Christian P. Robert,et al.  Introducing Monte Carlo Methods with R , 2009 .

[39]  David B. Hitchcock,et al.  A History of the Metropolis–Hastings Algorithm , 2003 .

[40]  D. Gamerman Markov chain Monte Carlo for dynamic generalised linear models , 1998 .

[41]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[42]  Artificial Intelligence Applications and Innovations. AIAI 2022 IFIP WG 12.5 International Workshops: MHDW 2022, 5G-PINE 2022, AIBMG 2022, ML@HC 2022, and AIBEI 2022, Hersonissos, Crete, Greece, June 17–20, 2022, Proceedings , 2022, AIAI Workshops.

[43]  Artificial Intelligence Applications and Innovations - 18th IFIP WG 12.5 International Conference, AIAI 2022, Hersonissos, Crete, Greece, June 17-20, 2022, Proceedings, Part I , 2022, AIAI.