Pre-treatment of outliers and anomalies in plant data: Methodology and case study of a Vacuum Distillation Unit

Data pre-treatment plays a significant role in improving data quality, thus allowing extraction of accurate information from raw data. One of the data pre-treatment techniques commonly used is outliers detection. The so-called 3σ method is a common practice to identify the outliers (using triple standard deviation as upper and lower limits). However, as shown in the manuscript, it does not identify all outliers, resulting in possible distortion of the overall statistics of the data. This problem can have a significant impact on further data analysis and can lead to reduction in the accuracy of predictive models. There is a plethora of various techniques for outliers detection, however, aside from theoretical work, they all require case study work. In this work, two types of outliers were considered: short-term (erroneous data, noise) and long-term outliers (mainly malfunctioning for longer periods of time). The data used were taken from the vacuum distillation unit (VDU) of an Asian refinery and included data points from 40 physical sensors (temperature, pressure and flow rate). We used a modified method for 3σ thresholds to identify the short-term outliers. More specifically, sensors data are divided into chunks determined by change points and 3σ thresholds are calculated within each chunk representing near-normal distribution. We have shown that piece-wise 3σ method offers a

[1]  Diane J. Cook,et al.  A survey of methods for time series change point detection , 2017, Knowledge and Information Systems.

[2]  Varun Chandola,et al.  Scalable Time Series Change Detection for Biomass Monitoring Using Gaussian Process , 2010, CIDU.

[3]  Jimeng Sun,et al.  Streaming Pattern Discovery in Multiple Time-Series , 2005, VLDB.

[4]  Changliang Zou,et al.  Nonparametric maximum likelihood approach to multiple change-point problems , 2014, 1405.7173.

[5]  Li Wei,et al.  Semi-supervised time series classification , 2006, KDD '06.

[6]  Jiguo Cao,et al.  Automated Load Curve Data Cleansing in Power Systems , 2010, IEEE Transactions on Smart Grid.

[7]  Arjun K. Gupta,et al.  ON CHANGE POINT DETECTION AND ESTIMATION , 2001 .

[8]  Deborah Estrin,et al.  Using mobile phones to determine transportation modes , 2010, TOSN.

[9]  Austin Henslee,et al.  Using Gaussian Mixture Models to Detect Outliers in Seasonal Univariate Network Traffic , 2017, 2017 IEEE Security and Privacy Workshops (SPW).

[10]  Y. Heyden,et al.  Robust statistics in data analysis — A review: Basic concepts , 2007 .

[11]  Niall M. Adams,et al.  Streaming changepoint detection for transition matrices , 2021, Data Mining and Knowledge Discovery.

[12]  Dexian Huang,et al.  Data-driven soft sensor development based on deep learning technique , 2014 .

[13]  Jason W. Osborne,et al.  The power of outliers (and why researchers should ALWAYS check for them) , 2004 .

[14]  P. K. Bhattacharya Maximum likelihood estimation of a change-point in the distribution of independent random variables: General multiparameter case , 1987 .

[15]  Eamonn J. Keogh,et al.  An online algorithm for segmenting time series , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[16]  Chris D. Nugent,et al.  Evaluation of Prompted Annotation of Activity Data Recorded from a Smart Phone , 2014, Sensors.

[17]  Cesare Alippi,et al.  Change Detection in Multivariate Datastreams: Likelihood and Detectability Loss , 2015, IJCAI.

[18]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[19]  Jafar Sadeghi,et al.  A data-driven soft-sensor for monitoring ASTM-D86 of CDU side products using local instrumental variable (LIV) technique , 2018 .

[20]  Mia Hubert,et al.  Computational Statistics and Data Analysis Robust Pca for Skewed Data and Its Outlier Map , 2022 .

[21]  Nenad Bolf,et al.  SOFT SENSORS FOR SPLITTER PRODUCT PROPERTY ESTIMATION IN CDU , 2011 .

[22]  Xing Xie,et al.  Learning transportation mode from raw gps data for geographic applications on the web , 2008, WWW.

[23]  V. Moskvina,et al.  An Algorithm Based on Singular Spectrum Analysis for Change-Point Detection , 2003 .

[24]  Nenad Bolf,et al.  Continuous estimation of kerosene cold filter plugging point using soft sensors , 2013 .

[25]  Diane J. Cook,et al.  Activity Learning: Discovering, Recognizing, and Predicting Human Behavior from Sensor Data , 2015 .

[26]  Kevin M. Carter,et al.  Probabilistic reasoning for streaming anomaly detection , 2012, 2012 IEEE Statistical Signal Processing Workshop (SSP).

[27]  Kwee-Bo Sim,et al.  Analysis the effect of PCA for feature reduction in non-stationary EEG based motor imagery of BCI system , 2014 .

[28]  Valentino Constantinou,et al.  Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding , 2018, KDD.

[29]  Jose A. Lozano,et al.  A Review on Outlier/Anomaly Detection in Time Series Data , 2020, ACM Comput. Surv..

[30]  Nenad Bolf,et al.  Soft sensor for continuous product quality estimation (in crude distillation unit) , 2011 .

[31]  Mia Hubert,et al.  Robust PCA and classification in biosciences , 2004, Bioinform..

[32]  Nenad Bolf,et al.  Soft Sensors Application for Crude Distillation Unit Product Quality Estimation , 2011 .

[33]  Lucas Lacasa,et al.  From time series to complex networks: The visibility graph , 2008, Proceedings of the National Academy of Sciences.

[34]  A. Madansky Identification of Outliers , 1988 .

[35]  Plamen P. Angelov,et al.  Soft sensor for predicting crude oil distillation side streams using evolving takagi-sugeno fuzzy models , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[36]  Zheng Chen,et al.  Fault Detection of Drinking Water Treatment Process Using PCA and Hotelling's T2 Chart , 2009 .

[37]  Misha Pavel,et al.  Outlier detection in weight time series of connected scales , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[38]  Martin Meckesheimer,et al.  Automatic outlier detection for time series: an application to sensor data , 2007, Knowledge and Information Systems.

[39]  Idris A. Eckley,et al.  changepoint: An R Package for Changepoint Analysis , 2014 .

[40]  I. J. Myung,et al.  Tutorial on maximum likelihood estimation , 2003 .

[41]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[42]  Aleksey S. Polunchenko,et al.  State-of-the-Art in Sequential Change-Point Detection , 2011, 1109.2938.

[43]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[44]  Sarthak Tiwari,et al.  A deep learning based data driven soft sensor for bioprocesses , 2018, Biochemical Engineering Journal.

[45]  Richard J. Povinelli,et al.  Time series outlier detection and imputation , 2014, 2014 IEEE PES General Meeting | Conference & Exposition.

[46]  Sten Bay Jørgensen,et al.  A systematic approach for soft sensor development , 2007, Comput. Chem. Eng..