Data preprocessing for multiblock modelling – A systematization with new methods

Abstract With the advance of Industry 4.0, new data collectors are appearing at different points of the process generating blocks of data whose integrity should be preserved during data analysis. This is the scope of multiblock methods, whose potential has been recognized in several areas of application where they are becoming increasingly popular. Multiblock methods can be applied to a wide range of data-driven problems that practitioners face nowadays such as plant-wide process monitoring and diagnosis, process optimization and quality prediction of key product properties. These methods have the ability to find associations and interpretative connections between different data blocks from different sources and carrying complementary or overlapping information, as well as assessing the blocks’ relative contributions to the final outcome. A critical stage in the application of multiblock methods is the selection of the appropriate preprocessing to apply to each block, before proceeding to the modelling. The preprocessing strategy can exponentiate the information extracted from the blocks and their mutual interactions or hide/mask/distort them if inappropriately done. In this article, we present a systematic workflow where both the intra-block and inter-block variation components are considered during preprocessing. We illustrate the application of the framework using two real case studies where a critical comparison is presented for the different preprocessing alternatives.

[1]  R. Bro,et al.  Centering and scaling in component analysis , 2003 .

[2]  Lutgarde M. C. Buydens,et al.  Breaking with trends in pre-processing? , 2013 .

[3]  Christopher D. Brown,et al.  Derivative Preprocessing and Optimal Corrections for Baseline Drift in Multivariate Calibration , 2000 .

[4]  Zhiqiang Ge,et al.  Distributed Parallel PCA for Modeling and Monitoring of Large-Scale Plant-Wide Processes With Big Data , 2017, IEEE Transactions on Industrial Informatics.

[5]  P. A. Taylor,et al.  Missing data methods in PCA and PLS: Score calculations with incomplete observations , 1996 .

[6]  Desire L. Massart,et al.  The robust normal variate transform for pattern recognition with near-infrared data , 1999 .

[7]  Zhiqiang Ge,et al.  Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data , 2018, Annu. Rev. Control..

[8]  Zhiqiang Ge,et al.  Improved two-level monitoring system for plant-wide processes , 2014 .

[9]  Age K. Smilde,et al.  Direct orthogonal signal correction , 2001 .

[10]  Yizeng Liang,et al.  Chemometric methods in data processing of mass spectrometry-based metabolomics: A review. , 2016, Analytica chimica acta.

[11]  Elena Tsiporkova,et al.  NMR-based characterization of metabolic alterations in hypertension using an adaptive, intelligent binning algorithm. , 2008, Analytical chemistry.

[12]  Tormod Næs,et al.  Regression models with process variables and parallel blocks of raw material measurements , 2008 .

[13]  Zhiqiang Ge,et al.  Review on data-driven modeling and monitoring for plant-wide industrial processes , 2017 .

[14]  Zhiqiang Ge,et al.  Distributed parallel deep learning of Hierarchical Extreme Learning Machine for multimode quality prediction with big process data , 2019, Eng. Appl. Artif. Intell..

[15]  A. Smilde,et al.  Fusion of mass spectrometry-based metabolomics data. , 2005, Analytical chemistry.

[16]  F Savorani,et al.  icoshift: A versatile tool for the rapid alignment of 1D NMR spectra. , 2010, Journal of magnetic resonance.

[17]  A. Smilde,et al.  Dynamic time warping of spectroscopic BATCH data , 2003 .

[18]  Svante Wold,et al.  Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection , 1996 .

[19]  Charlotta Johnsson,et al.  Plant-wide utility disturbance management in the process industry , 2013, Comput. Chem. Eng..

[20]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[21]  R. Barnes,et al.  Standard Normal Variate Transformation and De-Trending of Near-Infrared Diffuse Reflectance Spectra , 1989 .

[22]  P. Geladi,et al.  Linearization and Scatter-Correction for Near-Infrared Reflectance Spectra of Meat , 1985 .

[23]  S. Wold,et al.  Orthogonal signal correction of near-infrared spectra , 1998 .

[24]  Marco S. Reis,et al.  Network‐induced supervised learning: Network‐induced classification (NI‐C) and network‐induced regression (NI‐R) , 2013 .

[25]  Marco S. Reis,et al.  A Comparison Study of Single‐Scale and Multiscale Approaches for Data‐Driven and Model‐Based Online Denoising , 2014, Qual. Reliab. Eng. Int..

[26]  José C. Menezes,et al.  Multiblock PLS as an approach to compare and combine NIR and MIR spectra in calibrations of soybean flour , 2005 .

[27]  Ron S. Kenett,et al.  Assessing the value of information of data-centric activities in the chemical processing industry 4.0 , 2018, AIChE Journal.

[28]  Bruce R. Kowalski,et al.  Prediction of Product Quality from Spectral Data Using the Partial Least-Squares Method , 1984 .

[29]  Tormod Næs,et al.  Combining designed experiments with several blocks of spectroscopic data , 2007 .

[30]  Zhiqiang Ge,et al.  Two-level multiblock statistical monitoring for plant-wide processes , 2009 .

[31]  Bruce R. Kowalski,et al.  prediction of wine quality and geographic origin from chemical measurements by parital least-squares regression modeling , 1984 .

[32]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[33]  P. Eilers Parametric time warping. , 2004, Analytical chemistry.

[34]  Colm P. O'Donnell,et al.  Suppressing sample morphology effects in near infrared spectral imaging using chemometric data pre-treatments , 2012 .

[35]  Age K. Smilde,et al.  Optimized time alignment algorithm for LC-MS data: correlation optimized warping using component detection algorithm-selected mass chromatograms. , 2008, Analytical chemistry.

[36]  T. Næs,et al.  A comparison of methods for analysing regression models with both spectral and designed variables , 2004 .

[37]  J. Westerhuis,et al.  Multivariate modelling of the pharmaceutical two‐step process of wet granulation and tableting with multiblock partial least squares , 1997 .

[38]  Wen Wu,et al.  Peak Alignment of Urine NMR Spectra Using Fuzzy Warping , 2006, J. Chem. Inf. Model..

[39]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[40]  Herman Wold,et al.  Systems under indirect observation : causality, structure, prediction , 1982 .

[41]  Age K. Smilde,et al.  Data-processing strategies for metabolomics studies , 2011 .

[42]  Yizeng Liang,et al.  Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise , 1994 .

[43]  Rasmus Bro,et al.  Automated alignment of chromatographic data , 2006 .

[44]  S. Frosch Møller,et al.  Robust methods for multivariate data analysis , 2005 .

[45]  A. Mahadevan-Jansen,et al.  Automated Method for Subtraction of Fluorescence from Biological Raman Spectra , 2003, Applied spectroscopy.

[46]  T. Ebbels,et al.  Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling , 2003 .

[47]  Tom Fearn,et al.  On orthogonal signal correction , 2000 .

[48]  Frans van den Berg,et al.  Review of the most common pre-processing techniques for near-infrared spectra , 2009 .

[49]  Peng Xu,et al.  Decentralized fault detection and diagnosis via sparse PCA based decomposition and Maximum Entropy decision fusion , 2012 .

[50]  Serge Rezzi,et al.  Alignment using variable penalty dynamic time warping. , 2009, Analytical chemistry.

[51]  Tormod Næs,et al.  Multi-block regression based on combinations of orthogonalisation, PLS-regression and canonical correlation analysis , 2013 .

[52]  Leo H. Chiang,et al.  Exploring process data with the use of robust outlier detection algorithms , 2003 .

[53]  Zhi-huan Song,et al.  Distributed PCA Model for Plant-Wide Process Monitoring , 2013 .

[54]  Ana C Pereira,et al.  Advanced predictive methods for wine age prediction: Part II - A comparison study of multiblock regression approaches. , 2017, Talanta.

[55]  D. Massart,et al.  Dealing with missing data: Part II , 2001 .

[56]  Bokyoung Kang,et al.  Integrating independent component analysis and local outlier factor for plant-wide process monitoring , 2011 .

[57]  J. Roger,et al.  EPO–PLS external parameter orthogonalisation of PLS application to temperature-independent measurement of sugar content of intact fruits , 2003 .

[58]  A. Ferrer,et al.  Dealing with missing data in MSPC: several methods, different interpretations, some examples , 2002 .