A Comparison of Unsupervised Abnormality Detection Methods for Interstitial Lung Disease

Abnormality detection, also known as outlier detection or novelty detection, seeks to identify data that do not match an expected distribution. In medical imaging, this could be used to find data samples with possible pathology or, more generally, to exclude samples that are normal. This may be done by learning a model of normality, against which new samples are evaluated. In this paper four methods, each representing a different family of techniques, are compared: one-class support vector machine, isolation forest, local outlier factor, and fast-minimum covariance determinant estimator. Each method is evaluated on patches of CT interstitial lung disease where the patches are encoded with one of four embedding methods: principal component analysis, kernel principal component analysis, a flat autoencoder, and a convolutional autoencoder. The data consists of 5500 healthy patches from one patient cohort defining normality, and 2970 patches from a second patient cohort with emphysema, fibrosis, ground glass opacity, and micronodule pathology representing abnormality. From this second cohort 1030 healthy patches are used as an evaluation dataset. Evaluation occurs in both the accuracy (area under the ROC curve) and runtime efficiency. The fast-minimum covariance determinant estimator is demonstrated to have a fair time scaling with dataset dimensionality, while the isolation forest and one-class support vector machine scale well with dimensionality. The one-class support vector machine is the most accurate, closely followed by the isolation forest and fast-minimum covariance determinant estimator. The embeddings from kernel principal component analysis are the most generally useful.

[1]  Antoine Geissbühler,et al.  Building a reference multimedia database for interstitial lung diseases , 2012, Comput. Medical Imaging Graph..

[2]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[4]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5]  Vipin Kumar,et al.  Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach , 2003, Clustering and Information Retrieval.

[6]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[7]  Yi Li,et al.  Bootstrapping a data mining intrusion detection system , 2003, SAC '03.

[8]  Lauge Sørensen,et al.  Quantitative Analysis of Pulmonary Emphysema Using Local Binary Patterns , 2010, IEEE Transactions on Medical Imaging.

[9]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[10]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[11]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[13]  I. Jolliffe Principal Component Analysis and Factor Analysis , 1986 .

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.