Comparative evaluation of autocontouring in clinical practice: A practical method using the Turing test

PURPOSE Automated techniques for estimating the contours of organs and structures in medical images have become more widespread and a variety of measures are available for assessing their quality. Quantitative measures of geometric agreement, for example, overlap with a gold-standard delineation, are popular but may not predict the level of clinical acceptance for the contouring method. Therefore, surrogate measures that relate more directly to the clinical judgment of contours, and to the way they are used in routine workflows, need to be developed. The purpose of this study is to propose a method (inspired by the Turing Test) for providing contour quality measures that directly draw upon practitioners' assessments of manual and automatic contours. This approach assumes that an inability to distinguish automatically produced contours from those of clinical experts would indicate that the contours are of sufficient quality for clinical use. In turn, it is anticipated that such contours would receive less manual editing prior to being accepted for clinical use. In this study, an initial assessment of this approach is performed with radiation oncologists and therapists. METHODS Eight clinical observers were presented with thoracic organ-at-risk contours through a web interface and were asked to determine if they were automatically generated or manually delineated. The accuracy of the visual determination was assessed, and the proportion of contours for which the source was misclassified recorded. Contours of six different organs in a clinical workflow were for 20 patient cases. The time required to edit autocontours to a clinically acceptable standard was also measured, as a gold standard of clinical utility. Established quantitative measures of autocontouring performance, such as Dice similarity coefficient with respect to the original clinical contour and the misclassification rate accessed with the proposed framework, were evaluated as surrogates of the editing time measured. RESULTS The misclassification rates for each organ were: esophagus 30.0%, heart 22.9%, left lung 51.2%, right lung 58.5%, mediastinum envelope 43.9%, and spinal cord 46.8%. The time savings resulting from editing the autocontours compared to the standard clinical workflow were 12%, 25%, 43%, 77%, 46%, and 50%, respectively, for these organs. The median Dice similarity coefficients between the clinical contours and the autocontours were 0.46, 0.90, 0.98, 0.98, 0.94, and 0.86, respectively, for these organs. CONCLUSIONS A better correspondence with time saving was observed for the misclassification rate than the quantitative contour measures explored. From this, we conclude that the inability to accurately judge the source of a contour indicates a reduced need for editing and therefore a greater time saving overall. Hence, task-based assessments of contouring performance may be considered as an additional way of evaluating the clinical utility of autosegmentation methods.

[1]  Lena Cavallin,et al.  Automated CT-based segmentation and quantification of total intracranial volume , 2015, European Radiology.

[2]  Vincenzo Valentini,et al.  Recommendations on how to establish evidence from auto-segmentation software in radiotherapy. , 2014, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[3]  Vincenzo Valentini,et al.  PD-0133: A dosimetric analysis of Dice index and Hausdorff distance in H&N: Which index can evaluate autocontouring software? , 2014 .

[4]  Varol Akman,et al.  Turing Test: 50 Years Later , 2000, Minds and Machines.

[5]  Stevan Harnad,et al.  The Turing Test is not a trick: Turing indistinguishability is a scientific criterion , 1992, SGAR.

[6]  H. Irshad,et al.  Methods for Nuclei Detection, Segmentation, and Classification in Digital Histopathology: A Review—Current Status and Future Potential , 2014, IEEE Reviews in Biomedical Engineering.

[7]  E Weiss,et al.  SU-C-WAB-03: Assessing the Correlation Between Quantitative Measures of Contour Variability and Physician's Qualitative Measure for Clinical Usefulness of Auto-Segmentation in Prostate Cancer Radiotherapy. , 2013, Medical physics.

[8]  M. J. Gooding,et al.  Multicenter Clinical Assessment of DIR Atlas-Based Autocontouring , 2013 .

[9]  K. Gunderson VII.—THE IMITATION GAME , 1964 .

[10]  A. Larrue,et al.  The impact of the number of atlases on the performance of automatic multi-atlas contouring , 2015 .

[11]  Eduard Schreibmann,et al.  Multiatlas segmentation of thoracic and abdominal anatomy with level set‐based local search , 2014, Journal of applied clinical medical physics.

[12]  Torsten Rohlfing,et al.  Evaluation of atlas selection strategies for atlas-based image segmentation with application to confocal microscopy images of bee brains , 2004, NeuroImage.

[13]  Xiao Han,et al.  Clinical validation of atlas-based auto-segmentation of multiple target volumes and normal tissue (swallowing/mastication) structures in the head and neck. , 2011, International journal of radiation oncology, biology, physics.

[14]  Stéphane Supiot,et al.  Comparison of Automated Atlas-Based Segmentation Software for Postoperative Prostate Cancer Radiotherapy , 2016, Front. Oncol..

[15]  Martin Lundmark,et al.  Clinical evaluation of multi-atlas based segmentation of lymph node regions in head and neck and prostate cancer patients , 2013, Radiation oncology.

[16]  Lei Dong,et al.  Automatic segmentation of whole breast using atlas approach and deformable image registration. , 2009, International journal of radiation oncology, biology, physics.

[17]  Maarten L P Dirkx,et al.  Does atlas-based autosegmentation of neck levels require subsequent manual contour editing to avoid risk of severe target underdosage? A dosimetric analysis. , 2011, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[18]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[19]  G. Sharp,et al.  Vision 20/20: perspectives on automated image segmentation for radiotherapy. , 2014, Medical physics.

[20]  Stuart M. Shieber,et al.  Lessons from a restricted Turing test , 1994, CACM.

[21]  Michael McKay,et al.  Evaluation of atlas-based auto-segmentation software in prostate cancer patients , 2014, Journal of medical radiation sciences.

[22]  Paul Aljabar,et al.  Clinical evaluation of atlas and deep learning based automatic contouring for lung cancer. , 2017, Radiotherapy and oncology : journal of the European Society for Therapeutic Radiology and Oncology.

[23]  Shen-Chuan Tai,et al.  An Automatic Mass Detection System in Mammograms Based on Complex Texture Features , 2014, IEEE Journal of Biomedical and Health Informatics.