A Comparison of Clustering and Prediction Methods for Identifying Key Chemical–Biological Features Affecting Bioreactor Performance

Chemical–biological systems, such as bioreactors, contain stochastic and non-linear interactions which are difficult to characterize. The highly complex interactions between microbial species and communities may not be sufficiently captured using first-principles, stationary, or low-dimensional models. This paper compares and contrasts multiple data analysis strategies, which include three predictive models (random forests, support vector machines, and neural networks), three clustering models (hierarchical, Gaussian mixtures, and Dirichlet mixtures), and two feature selection approaches (mean decrease in accuracy and its conditional variant). These methods not only predict the bioreactor outcome with sufficient accuracy, but the important features correlated with said outcome are also identified. The novelty of this work lies in the extensive exploration and critique of a wide arsenal of methods instead of single methods, as observed in many papers of similar nature. The results show that random forest models predict the test set outcomes with the highest accuracy. The identified contributory features include process features which agree with domain knowledge, as well as several different biomarker operational taxonomic units (OTUs). The results reinforce the notion that both chemical and biological features significantly affect bioreactor performance. However, they also indicate that the quality of the biological features can be improved by considering non-clustering methods, which may better represent the true behaviour within the OTU communities.

[1]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[2]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[3]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[5]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[6]  Beatriz de la Iglesia,et al.  Clustering Rules: A Comparison of Partitioning and Hierarchical Clustering Algorithms , 2006, J. Math. Model. Algorithms.

[7]  Hong Han,et al.  Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest , 2016, 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS).

[8]  Sirish L. Shah,et al.  An Introduction to Alarm Analysis and Design , 2009 .

[9]  Dianhui Wang,et al.  Stochastic Configuration Networks: Fundamentals and Algorithms , 2017, IEEE Transactions on Cybernetics.

[10]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[11]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[12]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[13]  Sirish L. Shah,et al.  An Overview of Industrial Alarm Systems: Main Causes for Alarm Overloading, Research Status, and Open Problems , 2016, IEEE Transactions on Automation Science and Engineering.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[16]  Keaton Larson Lesnik,et al.  Predicting Microbial Fuel Cell Biofilm Communities and Bioreactor Performance using Artificial Neural Networks. , 2017, Environmental science & technology.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[19]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[20]  E. Alzate Modelos de mezclas Bernoulli con regresión logística: una aplicación en la valoración de carteras de crédito , 2020 .

[21]  J. Raes,et al.  Microbial interactions: from networks to models , 2012, Nature Reviews Microbiology.

[22]  A Dennis Lemly,et al.  Aquatic selenium pollution is a global environmental safety issue. , 2004, Ecotoxicology and environmental safety.

[23]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[24]  Brian J. McGill,et al.  A network approach for inferring species associations from co-occurrence data , 2016 .

[25]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[26]  Klaus-Robert Müller,et al.  Introduction to machine learning for brain imaging , 2011, NeuroImage.

[27]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[28]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[29]  Akira Sasaki,et al.  Statistical Mechanics of Population: The Lattice Lotka-Volterra Model , 1992 .

[30]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[31]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[32]  Lu Zhang,et al.  Data-Based Predictive Control for Wastewater Treatment Process , 2018, IEEE Access.

[33]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[34]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[35]  Thomas F. Edgar,et al.  Process Dynamics and Control , 1989 .

[36]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[37]  William M. Campbell,et al.  Support vector machines for speaker and language recognition , 2006, Comput. Speech Lang..

[38]  Xiaoli Chai,et al.  Effect of different carbon sources on denitrification performance, microbial community structure and denitrification genes. , 2018, The Science of the total environment.

[39]  Junfei Qiao,et al.  Multiobjective design of fuzzy neural network controller for wastewater treatment process , 2018, Appl. Soft Comput..

[40]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[41]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[42]  Hongzhe Li Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis , 2015 .

[43]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[44]  David J. Edwards,et al.  Hypothesis Testing and Power Calculations for Taxonomic-Based Human Microbiome Data , 2012, PloS one.

[45]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[46]  Karoline Faust,et al.  Multi-stability and the origin of microbial community types , 2017, The ISME Journal.

[47]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[48]  Yongmei Cheng,et al.  A Comparison of Methods for Clustering 16S rRNA Sequences into OTUs , 2013, PloS one.

[49]  E. Mcarthur,et al.  RCLUS, A NEW PROGRAM FOR CLUSTERING ASSOCIATED SPECIES: A DEMONSTRATION USING A MOJAVE DESERT PLANT COMMUNITY DATASET , 2006 .

[50]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[51]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[52]  Y. Takeuchi Global Dynamical Properties of Lotka-Volterra Systems , 1996 .

[53]  Junfei Qiao,et al.  Adaptive fuzzy neural network control of wastewater treatment process with multiobjective operation , 2018, Neurocomputing.

[54]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[55]  Martin Grube,et al.  Analyzing the antagonistic potential of the lichen microbiome against pathogens by bridging metagenomic with culture studies , 2015, Front. Microbiol..

[56]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[57]  D. Sejdinovic,et al.  Detecting causal associations in large nonlinear time series datasets , 2018 .

[58]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .