BAGLS, a multihospital Benchmark for Automatic Glottis Segmentation

Laryngeal videoendoscopy is one of the main tools in clinical examinations for voice disorders and voice research. Using high-speed videoendoscopy, it is possible to fully capture the vocal fold oscillations, however, processing the recordings typically involves a time-consuming segmentation of the glottal area by trained experts. Even though automatic methods have been proposed and the task is particularly suited for deep learning methods, there are no public datasets and benchmarks available to compare methods and to allow training of generalizing deep learning models. In an international collaboration of researchers from seven institutions from the EU and USA, we have created BAGLS, a large, multihospital dataset of 59,250 high-speed videoendoscopy frames with individually annotated segmentation masks. The frames are based on 640 recordings of healthy and disordered subjects that were recorded with varying technical equipment by numerous clinicians. The BAGLS dataset will allow an objective comparison of glottis segmentation methods and will enable interested researchers to train their own models and compare their methods. Measurement(s) glottis • Image Segmentation Technology Type(s) Endoscopic Procedure • neural network model Factor Type(s) age • sex • healthy versus disordered subjects • recording conditions Sample Characteristic - Organism Homo sapiens Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12387890

[1]  S. Kniesburges,et al.  Dependencies and Ill-designed Parameters Within High-speed Videoendoscopy and Acoustic Signal Analysis. , 2019, Journal of voice : official journal of the Voice Foundation.

[2]  Terri Treman Gerlach,et al.  Clinical Implementation of Laryngeal High-Speed Videoendoscopy: Challenges and Evolution , 2007, Folia Phoniatrica et Logopaedica.

[3]  D. Bless,et al.  A multi-media, computer-based method for stroboscopy rating training. , 1998, Journal of voice : official journal of the Voice Foundation.

[4]  Rita R. Patel,et al.  Comparison of High-Speed Digital Imaging with Stroboscopy for Laryngeal Imaging of Glottal Disorders , 2008, The Annals of otology, rhinology, and laryngology.

[5]  M. Johns,et al.  Mid-membranous Vocal Fold Webs: Case Series. , 2017, Journal of voice : official journal of the Voice Foundation.

[6]  Rita R. Patel,et al.  Spatiotemporal Quantification of Vocal Fold Vibration After Exposure to Superficial Laryngeal Dehydration: A Preliminary Study. , 2016, Journal of voice : official journal of the Voice Foundation.

[7]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[8]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[9]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[10]  S. Gray,et al.  Voice Disorders in the General Population: Prevalence, Risk Factors, and Occupational Impact , 2005, The Laryngoscope.

[11]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[12]  A. Alwan,et al.  Variability in the relationships among voice quality, harmonic amplitudes, open quotient, and glottal area waveform shape in sustained phonation. , 2012, The Journal of the Acoustical Society of America.

[13]  R. Martins,et al.  Voice disorders in teachers. A review. , 2014, Journal of voice : official journal of the Voice Foundation.

[14]  P. Dejonckere,et al.  A basic protocol for functional assessment of voice pathology, especially for investigating the efficacy of (phonosurgical) treatments and evaluating new assessment techniques , 2001, European Archives of Oto-Rhino-Laryngology.

[15]  Ronald M. Summers,et al.  Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[16]  Song Wang,et al.  Segmentation of laryngeal high-speed videoendoscopy in temporal domain using paired active contours , 2009, MAVEBA.

[17]  Andrew Y. Ng,et al.  CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning , 2017, ArXiv.

[18]  N. Saintilan,et al.  Blue carbon potential of coastal wetland restoration varies with inundation and rainfall , 2019, Scientific Reports.

[19]  M. Döllinger,et al.  Vocal fold vibration amplitude, open quotient, speed quotient and their variability along glottal length: Kymographic data from normal subjects , 2013, Logopedics, phoniatrics, vocology.

[20]  H. K. Schutte,et al.  Videokymography: high-speed line scanning of vocal fold vibration. , 1996, Journal of voice : official journal of the Voice Foundation.

[21]  Brian B. Avants,et al.  The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) , 2015, IEEE Transactions on Medical Imaging.

[22]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[23]  Evangelia I. Kosma,et al.  Checking for voice disorders without clinical intervention: The Greek and global VHI thresholds for voice disordered patients , 2019, Scientific Reports.

[24]  T. Cleveland,et al.  The clinical usefulness of laryngeal videostroboscopy and the role of high-speed cinematography in laryngeal evaluation , 2002 .

[25]  Ananthram Swami,et al.  The Limitations of Deep Learning in Adversarial Settings , 2015, 2016 IEEE European Symposium on Security and Privacy (EuroS&P).

[26]  Rita R. Patel,et al.  Spatiotemporal analysis of vocal fold vibrations between children and adults , 2012, The Laryngoscope.

[27]  Weidong Hu,et al.  Diversity in Machine Learning , 2018, IEEE Access.

[28]  Nelson Roy,et al.  Direct health care costs of laryngeal diseases and disorders , 2012, The Laryngoscope.

[29]  Susana Vaz-Freitas,et al.  Prevalence of Voice Disorders in Singers: Systematic Review and Meta-Analysis. , 2017, Journal of voice : official journal of the Voice Foundation.

[30]  Heinz Ulrich Hoppe,et al.  Vibration parameter extraction from endoscopic image series of the vocal folds , 2002, IEEE Transactions on Biomedical Engineering.

[31]  S. Dwivedi,et al.  Obesity May Be Bad: Compressed Convolutional Networks for Biomedical Image Segmentation , 2020 .

[32]  P. Woo,et al.  Glottal Area Waveform Analysis of Benign Vocal Fold Lesions before and after Surgery , 2000, The Annals of otology, rhinology, and laryngology.

[33]  Jack J Jiang,et al.  Efficient and effective extraction of vocal fold vibratory patterns from high-speed digital imaging. , 2010, Journal of voice : official journal of the Voice Foundation.

[34]  Frank Rosanowski,et al.  Clinically evaluated procedure for the reconstruction of vocal fold vibrations from endoscopic digital high-speed videos , 2007, Medical Image Anal..

[35]  B. Barsties,et al.  Assessment of voice quality: Current state-of-the-art. , 2015, Auris, nasus, larynx.

[36]  Juan Ignacio Godino-Llorente,et al.  Glottal Gap tracking by a continuous background modeling using inpainting , 2017, Medical & Biological Engineering & Computing.

[37]  Ken-Ichi Sakakibara,et al.  Age- and gender-related difference of vocal fold vibration and glottal configuration in normal speakers: analysis with glottal area waveform. , 2014, Journal of voice : official journal of the Voice Foundation.

[38]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[39]  Vijayan K. Asari,et al.  The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches , 2018, ArXiv.

[40]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[41]  M. Döllinger,et al.  The Influence of Vowels on Vocal Fold Dynamics in the Tenor's Passaggio. , 2017, Journal of voice : official journal of the Voice Foundation.

[42]  M. Döllinger,et al.  Oscillatory Characteristics of the Vocal Folds Across the Tenor Passaggio. , 2017, Journal of voice : official journal of the Voice Foundation.

[43]  Marion Semmler,et al.  Laryngeal Pressure Estimation With a Recurrent Neural Network , 2018, IEEE Journal of Translational Engineering in Health and Medicine.

[44]  Melda Kunduk,et al.  Variability of normal vocal fold dynamics for different vocal loading in one healthy subject investigated by phonovibrograms. , 2009, Journal of voice : official journal of the Voice Foundation.

[45]  N. Roy,et al.  Voice disorders in the elderly: A national database study , 2016, The Laryngoscope.

[46]  Thomas Wittenberg,et al.  Digital Kymography for the Analysis of the Opening and Closure Intervals of Heart Valves , 2011, Bildverarbeitung für die Medizin.

[47]  Tobias Ortmaier,et al.  A dataset of laryngeal endoscopic images with comparative study on convolution neural network-based semantic segmentation , 2018, International Journal of Computer Assisted Radiology and Surgery.

[48]  J. Lohscheller,et al.  Phonovibrogram Visualization of Entire Vocal Fold Dynamics , 2008, The Laryngoscope.

[49]  J. Hillenbrand,et al.  Cepstral Peak Prominence: A More Reliable Measure of Dysphonia , 2003, The Annals of otology, rhinology, and laryngology.

[50]  Henry Völzke,et al.  Fully Automated Glottis Segmentation in Endoscopic Videos Using Local Color and Shape Features of Glottal Regions , 2015, IEEE Transactions on Biomedical Engineering.

[51]  I. Deary,et al.  The quality of life impact of dysphonia. , 2002, Clinical otolaryngology and allied sciences.

[52]  Rita R. Patel,et al.  Biomechanical simulation of vocal fold dynamics in adults based on laryngeal high-speed videoendoscopy , 2017, PloS one.

[53]  R. Zraick,et al.  The effect of speaking task on perceptual judgment of the severity of dysphonic voice. , 2005, Journal of voice : official journal of the Voice Foundation.

[54]  Ronald M. Summers,et al.  Deep Learning in Medical Imaging: Overview and Future Promise of an Exciting New Technique , 2016 .

[55]  Anita Szakay,et al.  An acoustic analysis of voice quality in London English: The effect of gender, ethnicity and f0 , 2015, ICPhS.

[56]  Bernhard Schick,et al.  Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network , 2020, PloS one.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  M Pedersen,et al.  Which Mathematical and Physiological Formulas are Describing VoicePathology: An Overview , 2016 .

[59]  Erwan Pépiot,et al.  Voice, speech and gender:. Male-female acoustic differences and cross-language variation in English and French speakers , 2012 .

[60]  Alexandr A. Kalinin,et al.  Albumentations: fast and flexible image augmentations , 2018, Inf..

[61]  Chi Zhu,et al.  Automatic tracing of vocal-fold motion from high-speed digital images , 2006, IEEE Transactions on Biomedical Engineering.

[62]  Francesco Bonchi,et al.  Algorithmic Bias: From Discrimination Discovery to Fairness-aware Data Mining , 2016, KDD.

[63]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[64]  M. Döllinger,et al.  Intersegmenter Variability in High‐Speed Laryngoscopy‐Based Glottal Area Waveform Measures , 2019, The Laryngoscope.

[65]  D. Mehta,et al.  Evidence-based clinical voice assessment: a systematic review. , 2013, American journal of speech-language pathology.

[66]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[67]  Marion Semmler,et al.  Low-light image enhancement of high-speed endoscopic videos using a convolutional neural network , 2019, Medical & Biological Engineering & Computing.

[68]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..