A method for outlier detection based on cluster analysis and visual expert criteria

Outlier detection is an important problem occurring in a wide range of areas. Outliers are the outcome of fraudulent behaviour, mechanical faults, human error, or simply natural deviations. Many data mining applications perform outlier detection, often as a preliminary step in order to filter out outliers and build more representative models. In this paper, we propose an outlier detection method based on a clustering process. The aim behind the proposal outlined in this paper is to overcome the specificity of many existing outlier detection techniques that fail to take into account the inherent dispersion of domain objects. The outlier detection method is based on four criteria designed to represent how human beings (experts in each domain) visually identify outliers within a set of objects after analysing the clusters. This has an advantage over other clustering‐based outlier detection techniques that are founded on a purely numerical analysis of clusters. Our proposal has been evaluated, with satisfactory results, on data (particularly time series) from two different domains: stabilometry, a branch of medicine studying balance‐related functions in human beings and electroencephalography (EEG), a neurological exploration used to diagnose nervous system disorders. To validate the proposed method, we studied method outlier detection and efficiency in terms of runtime. The results of regression analyses confirm that our proposal is useful for detecting outlier data in different domains, with a false positive rate of less than 2% and a reliability greater than 99%.

[1]  Lara Torralbo,et al.  Marco de Descubrimiento de Conocimiento para DatosEstructuralmente Complejos con Énfasis en el Análisis de Eventos en Series Temporales , 2011 .

[2]  Dimitrios I. Fotiadis,et al.  Automatic Seizure Detection Based on Time-Frequency Analysis and Artificial Neural Networks , 2007, Comput. Intell. Neurosci..

[3]  Carlos Soares,et al.  Outlier Detection using Clustering Methods: a data cleaning application , 2004 .

[4]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[5]  Montserrat Lázaro,et al.  Valor de la posturografía en ancianos con caídas de repetición , 2005 .

[6]  Hassan Takabi,et al.  Using EEG Signal to Analyze IS Decision Making Cognitive Processes , 2018 .

[7]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  Hiroki Takada,et al.  Stabilometry to Evaluate Severity of Motion Sickness on Displays , 2018, Current Topics in Environmental Health and Preventive Medicine.

[9]  F. Owen Black,et al.  Postural Control in Four Classes of Vestibular Abnormalities1 , 1985 .

[10]  Jeen-Shing Wang,et al.  A Cluster Validity Measure With Outlier Detection for Support Vector Clustering , 2008, IEEE Trans. Syst. Man Cybern. Part B.

[11]  Luís Torgo,et al.  Detecting Errors in Foreign Trade Transactions: Dealing with Insufficient Data , 2009, EPIA.

[12]  Wang Jeen-Shing,et al.  A Cluster Validity Measure With Outlier Detection for Support Vector Clustering , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Luís Torgo,et al.  Resource-Bounded Fraud Detection , 2007, EPIA Workshops.

[14]  Shikha Agrawal,et al.  Survey on Anomaly Detection using Data Mining Techniques , 2015, KES.

[15]  Juan Alfonso Lara,et al.  A general framework for time series data mining based on event analysis: Application to the medical domains of electroencephalography and stabilometry , 2014, J. Biomed. Informatics.

[16]  Kenton R Kaufman,et al.  Significant reduction in risk of falls and back pain in osteoporotic-kyphotic women through a Spinal Proprioceptive Extension Exercise Dynamic (SPEED) program. , 2005, Mayo Clinic proceedings.

[17]  S. Scataglini Posturography , 2019, DHM and Posturography.

[18]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[19]  F. Chung,et al.  The Assessment of Postural Stability After Ambulatory Anesthesia: A Comparison of Desflurane with Propofol , 2002 .

[20]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[21]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[22]  Maurizio Filippone,et al.  A comparative evaluation of outlier detection algorithms: Experiments and analyses , 2018, Pattern Recognit..

[23]  Doo-Hwan Bae,et al.  An Approach to Outlier Detection of Software Measurement Data using the K-means Clustering Method , 2007, ESEM 2007.

[24]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[25]  Ian T. Jolliffe,et al.  Principal Component Analysis , 1986, Springer Series in Statistics.

[26]  Tomoe Yoshida,et al.  Japanese standard for clinical stabilometry assessment: Current status and future directions. , 2018, Auris, nasus, larynx.

[27]  B Kovalerchuk,et al.  Consistent knowledge discovery in medical diagnosis. , 2000, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[28]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[29]  Abdulhamit Subasi,et al.  A decision support system for automated identification of sleep stages from single-channel EEG signals , 2017, Knowl. Based Syst..

[30]  William Perrizo,et al.  A vertical outlier detection algorithm with clusters as by-product , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[31]  Ashish Ghosh,et al.  Integration of deep feature extraction and ensemble learning for outlier detection , 2019, Pattern Recognit..

[32]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[33]  Clara Pizzuti,et al.  Distance-based detection and prediction of outliers , 2006, IEEE Transactions on Knowledge and Data Engineering.

[34]  Peng Yang,et al.  A Spectral Clustering Algorithm for Outlier Detection , 2008, 2008 International Seminar on Future Information Technology and Management Engineering.

[35]  Gentiane Haesbroeck,et al.  Comparison of local outlier detection techniques in spatial multivariate data , 2017, Data Mining and Knowledge Discovery.

[36]  Warren S. Sarle,et al.  Cubic Clustering Criterion , 1983 .

[37]  Shian-Shyong Tseng,et al.  Two-phase clustering process for outliers detection , 2001, Pattern Recognit. Lett..

[38]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[39]  E Donchin,et al.  Brain-computer interface technology: a review of the first international meeting. , 2000, IEEE transactions on rehabilitation engineering : a publication of the IEEE Engineering in Medicine and Biology Society.

[40]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[41]  A. Ben Hamza,et al.  Cluster pca for outliers detection in high-dimensional data , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[42]  J. M. Ronda,et al.  Asociación entre síntomas clínicos y resultados de la posturografía computarizada dinámica , 2002 .

[43]  Richard J. Povinelli,et al.  Time series data mining: identifying temporal patterns for characterization and prediction of time series events , 1999 .

[44]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[45]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[46]  F O Black,et al.  Vestibulo-spinal control differs in patients with reduced versus distorted vestibular function. , 1984, Acta oto-laryngologica. Supplementum.

[47]  J. Eisman,et al.  Identification of High‐Risk Individuals for Hip Fracture: A 14‐Year Prospective Study , 2005, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[48]  R. Barry,et al.  A review of electrophysiology in attention-deficit/hyperactivity disorder: I. Qualitative and quantitative electroencephalography , 2003, Clinical Neurophysiology.

[49]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[50]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[51]  Hui Xiong,et al.  Manhattan Distance , 2008, Encyclopedia of GIS.

[52]  Eduardo Martín Sanz,et al.  Vértigo paroxístico benigno infantil: categorización y comparación con el vértigo posicional paroxístico benigno del adulto , 2007 .

[53]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.