Data Evaluation and Enhancement for Quality Improvement of Machine Learning

Poor data quality has a direct impact on the performance of the machine learning system that is built on the data. As a demonstrated effective approach for data quality improvement, transfer learning has been widely used to improve machine learning quality. However, the “quality improvement” brought by transfer learning was rarely rigorously validated, and some of the quality improvement results were misleading. This article first exposed the hidden quality problem in the datasets used to build a machine learning system for normalizing medical concepts in social media text. The system was claimed to have achieved the best performance compared to existing work on a machine learning task. However, the results of our experiments showed that the “best performance” was due to the poor quality of the datasets and the defective validation process. To address the data quality issue and build a high-performance medical concept normalization system, we developed a transfer-learning-based strategy for data quality enhancement and system performance improvement. The results of the experiments showed a strong correlation between the quality of the datasets and the performance of the machine learning system. The results also demonstrated that a rigorous evaluation of data quality is necessary for guiding the quality improvement of machine learning. Therefore, we propose a data quality evaluation framework that includes the quality criteria and their corresponding evaluation approaches. The data validation process, the performance improvement strategy, and the data quality evaluation framework discussed in this article can be used for machine learning researchers and practitioners to build high-performance machine learning systems. The code and datasets used in this research are available in GitHub (https://github.com/haihua0913/dataEvaluationML).

[1]  Ronald M. Summers,et al.  Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning , 2016, IEEE Transactions on Medical Imaging.

[2]  Carlo Batini,et al.  On the Meaningfulness of “Big Data Quality” (Invited Paper) , 2015, Data Science and Engineering.

[3]  Y WangRichard,et al.  Anchoring data quality dimensions in ontological foundations , 1996 .

[4]  Xiaolong Wang,et al.  CNN-based ranking for biomedical entity normalization , 2017, BMC Bioinformatics.

[5]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[6]  Hui Xiong,et al.  A Comprehensive Survey on Transfer Learning , 2019, Proceedings of the IEEE.

[7]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Kalyanmoy Deb,et al.  Neural Architecture Transfer , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[10]  Lei Zhang,et al.  Fine-Tuning Convolutional Neural Networks for Biomedical Image Analysis: Actively and Incrementally , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[12]  Yu Sa,et al.  Analysis of cellular objects through diffraction images acquired by flow cytometry. , 2013, Optics express.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[15]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[16]  Wei Zhang,et al.  Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources , 2015, Proc. VLDB Endow..

[17]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[18]  Jerry Zeyu Gao,et al.  Big Data Validation and Quality Assurance -- Issuses, Challenges, and Needs , 2016, 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE).

[19]  Alok N. Choudhary,et al.  Medical Concept Normalization for Online User-Generated Texts , 2017, 2017 IEEE International Conference on Healthcare Informatics (ICHI).

[20]  Nigel Collier,et al.  Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation , 2016, ACL.

[21]  Li Yujian,et al.  A comparative study of fine-tuning deep learning models for plant disease identification , 2019, Comput. Electron. Agric..

[22]  Ali Tahir,et al.  Classification Of Breast Cancer Histology Images Using ALEXNET , 2018, ICIAR.

[23]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[24]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Arash Shahriari Distributed Deep Transfer Learning by Basic Probability Assignment , 2017, ArXiv.

[26]  Chang Su,et al.  Hybrid Recommender System based on Deep Learning Model , 2020 .

[27]  Yi Luo,et al.  Multi-Task Medical Concept Normalization Using Multi-View Convolutional Neural Network , 2018, AAAI.

[28]  Jiangping Chen,et al.  A Practical Framework for Evaluating the Quality of Knowledge Graph , 2019, CCKS.

[29]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[30]  Venkat N. Gudivada,et al.  A Case Study of the Augmentation and Evaluation of Training Data for Deep Learning , 2019, ACM J. Data Inf. Qual..

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Michael Felderer,et al.  Risk-based data validation in machine learning-based software systems , 2019, MaLTeSQuE@ESEC/SIGSOFT FSE.

[33]  Andrew Y. Ng,et al.  CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning , 2017, ArXiv.

[34]  Jonathan Krause,et al.  The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition , 2015, ECCV.

[35]  Bartosz Krawczyk,et al.  On the Influence of Class Noise in Medical Data Classification: Treatment Using Noise Filtering Methods , 2016, Appl. Artif. Intell..

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Zhiyong Lu,et al.  An Inference Method for Disease Name Normalization , 2012, AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

[38]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[39]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[40]  Erik M. van Mulligen,et al.  Using rule-based natural language processing to improve disease normalization in biomedical text , 2012, J. Am. Medical Informatics Assoc..

[41]  Eleni Giannoulatou,et al.  Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie , 2014, BMC Bioinformatics.

[42]  Zhengya Sun,et al.  Multi-task Character-Level Attentional Networks for Medical Concept Normalization , 2018, Neural Processing Letters.

[43]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[44]  Jun Zhao,et al.  AI-GAN: Attack-Inspired Generation of Adversarial Examples , 2020, ArXiv.

[45]  Yang Song,et al.  Robust and distributed web-scale near-dup document conflation in microsoft academic service , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[46]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[48]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[49]  Hans Knutsson,et al.  Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates , 2016, Proceedings of the National Academy of Sciences.

[50]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[51]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[52]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[53]  Gavriel Salomon,et al.  T RANSFER OF LEARNING , 1992 .