Data-centric Artificial Intelligence: A Survey

Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler of its great success is the availability of abundant and high-quality data for building machine learning models. Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI. The attention of researchers and practitioners has gradually shifted from advancing model design to enhancing the quality and quantity of the data. In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals (training data development, inference data development, and data maintenance) and the representative methods. We also organize the existing literature from automation and collaboration perspectives, discuss the challenges, and tabulate the benchmarks for various tasks. We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle. We hope it can help the readers efficiently grasp a broad picture of this field, and equip them with the techniques and further research ideas to systematically engineer data for building AI systems. A companion list of data-centric AI resources will be regularly updated on https://github.com/daochenzha/data-centric-AI

[1]  Lei Zou,et al.  Knowledge Graph Quality Management: A Comprehensive Survey , 2023, IEEE Transactions on Knowledge and Data Engineering.

[2]  Guanchu Wang,et al.  Weight Perturbation Can Help Fairness under Distribution Shift , 2023, ArXiv.

[3]  Fan Yang,et al.  CoRTX: Contrastive Framework for Real-time Explanation , 2023, ICLR.

[4]  D. Zha,et al.  Towards Personalized Preprocessing Pipeline Search , 2023, ArXiv.

[5]  D. Zha,et al.  Active Ensemble Learning for Knowledge Graph Error Detection , 2023, WSDM.

[6]  D. Zha,et al.  Fairly Predicting Graft Failure in Liver Transplant for Organ Assigning , 2023, AMIA.

[7]  Philip S. Yu,et al.  Weakly Supervised Anomaly Detection: A Survey , 2023, ArXiv.

[8]  Christian Hammacher,et al.  REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines , 2023, EDBT.

[9]  Fan Yang,et al.  Efficient XAI Techniques: A Taxonomic Survey , 2023, ArXiv.

[10]  Zaid Pervaiz Bhat,et al.  Data-centric AI: Perspectives and Challenges , 2023, SDM.

[11]  Rui Chen,et al.  Bring Your Own View: Graph Neural Networks for Link Prediction with Personalized Subgraph Selection , 2022, WSDM.

[12]  G. Satzger,et al.  Data-centric Artificial Intelligence , 2022, ArXiv.

[13]  Mengnan Du,et al.  Mitigating Relational Bias on Knowledge Graphs , 2022, ArXiv.

[14]  Shion Guha,et al.  The Principles of Data-Centric AI (DCAI) , 2022, ArXiv.

[15]  M. Schaar,et al.  DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems , 2022, ArXiv.

[16]  D. Zha,et al.  RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations , 2022, arXiv.org.

[17]  Meghana Deodhar,et al.  A human-ML collaboration framework for improving video content reviews , 2022, CIKM Workshops.

[18]  A. Kejariwal,et al.  DreamShard: Generalizable Embedding Table Placement for Recommender Systems , 2022, Neural Information Processing Systems.

[19]  B. Ghosh,et al.  A Feature Extraction & Selection Benchmark for Structural Health Monitoring , 2022, Structural Health Monitoring.

[20]  D. Zha,et al.  Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning , 2022, CIKM.

[21]  Mengnan Du,et al.  Towards Learning Disentangled Representations for Time Series , 2022, KDD.

[22]  B. Schiele,et al.  USB: A Unified Semi-supervised Learning Benchmark , 2022, NeurIPS.

[23]  Yi-An Ma,et al.  AutoShard: Automated Embedding Table Sharding for Recommender Systems , 2022, KDD.

[24]  Margaret J. Warren,et al.  DataPerf: Benchmarks for Data-Centric AI Development , 2022, ArXiv.

[25]  Ethan Fetaya,et al.  A Study on the Evaluation of Generative Models , 2022, ArXiv.

[26]  Fan Yang,et al.  Accelerating Shapley Explanation via Contributive Cooperator Selection , 2022, ICML.

[27]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[28]  Ksheera R Shetty,et al.  Deep Learning for Computer Vision: A Brief Review , 2022, International Journal of Advanced Research in Science, Communication and Technology.

[29]  A. Sowmya,et al.  Blood-based transcriptomic signature panel identification for cancer diagnosis: Benchmarking of feature extraction methods , 2022, bioRxiv.

[30]  Shafiq R. Joty,et al.  Chart-to-Text: A Large-Scale Benchmark for Chart Summarization , 2022, ACL.

[31]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[32]  Hanghang Tong,et al.  Data Augmentation for Deep Graph Learning , 2022, SIGKDD Explor..

[33]  Ninghao Liu,et al.  G-Mixup: Graph Data Augmentation for Graph Classification , 2022, ICML.

[34]  Mehmet Gorkem Ulkar,et al.  BED: A Real-Time Object Detection System for Edge Devices , 2022, CIKM.

[35]  Alexander J. Ratner,et al.  A Survey on Programmatic Weak Supervision , 2022, ArXiv.

[36]  A. Mostafavi,et al.  FMP: Toward Fair Graph Message Passing against Topology Bias , 2022, ArXiv.

[37]  V. Metsis,et al.  TTS-GAN: A Transformer-based Time-Series Generative Adversarial Network , 2022, AIME.

[38]  Daochen Zha,et al.  Towards Similarity-Aware Time-Series Classification , 2022, SDM.

[39]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Michael Hay,et al.  Benchmarking Differentially Private Synthetic Data Generation Algorithms , 2021, ArXiv.

[41]  Steven Euijong Whang,et al.  Data collection and quality challenges in deep learning: a data-centric AI perspective , 2021, The VLDB Journal.

[42]  Neoklis Polyzotis,et al.  What can Data-Centric AI Learn from Data and ML Engineering? , 2021, ArXiv.

[43]  Cathy H. Wu,et al.  A crowdsourcing open platform for literature curation in UniProt , 2021, PLoS biology.

[44]  Lace M. K. Padilla,et al.  The Science of Visual Data Communication: What Works , 2021, Psychological science in the public interest : a journal of the American Psychological Society.

[45]  Lora Aroyo,et al.  Data Excellence for AI: Why Should You Care , 2021, ArXiv.

[46]  Ninghao Liu,et al.  Modeling Techniques for Machine Learning Fairness: A Survey , 2021, ArXiv.

[47]  Juliana Freire,et al.  AlphaD3M: Machine Learning Pipeline Synthesis , 2021, ArXiv.

[48]  L. Nanni,et al.  Comparison of Different Image Data Augmentation Approaches , 2021, J. Imaging.

[49]  Bin Cui,et al.  Facilitating Database Tuning with Hyper-Parameter Optimization: A Comprehensive Experimental Evaluation , 2021, Proc. VLDB Endow..

[50]  Vidya Setlur,et al.  Snowy: Recommending Utterances for Conversational Visual Analysis , 2021, UIST.

[51]  Leixian Shen,et al.  Towards Natural Language Interfaces for Data Visualization: A Survey , 2021, IEEE Transactions on Visualization and Computer Graphics.

[52]  J. Rahnenführer,et al.  Benchmark of filter methods for feature selection in high-dimensional gene expression survival data , 2021, Briefings Bioinform..

[53]  Leilani Battle,et al.  An Evaluation-Focused Framework for Visualization Recommendation Algorithms , 2021, IEEE Transactions on Visualization and Computer Graphics.

[54]  D. Zha,et al.  Automated Anomaly Detection via Curiosity-Guided Search and Self-Imitation Learning , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[55]  Peng Cui,et al.  Towards Out-Of-Distribution Generalization: A Survey , 2021, ArXiv.

[56]  Moritz Hardt,et al.  Retiring Adult: New Datasets for Fair Machine Learning , 2021, NeurIPS.

[57]  Zaid Pervaiz Bhat,et al.  AutoVideo: An Automated Video Action Recognition System , 2021, IJCAI.

[58]  J. V. D. Heuvel,et al.  CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms , 2021, NeurIPS Datasets and Benchmarks.

[59]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[60]  Zahed Siddique,et al.  Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance , 2021, Technologies.

[61]  Jenna Wiens,et al.  Mind the Performance Gap: Examining Dataset Shift During Prospective Validation , 2021, MLHC.

[62]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[63]  Felix Bießmann,et al.  A Benchmark for Data Imputation Methods , 2021, Frontiers in Big Data.

[64]  Xia Hu,et al.  Dirichlet Energy Constrained Learning for Deep Graph Neural Networks , 2021, NeurIPS.

[65]  Diederik P. Kingma,et al.  Variational Diffusion Models , 2021, ArXiv.

[66]  Taghi M. Khoshgoftaar,et al.  Text Data Augmentation for Deep Learning , 2021, Journal of Big Data.

[67]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[68]  J. Dowling,et al.  A review of medical image data augmentation techniques for deep learning applications , 2021, Journal of medical imaging and radiation oncology.

[69]  Xiangru Lian,et al.  DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning , 2021, ICML.

[70]  Xia Hu,et al.  Simplifying Deep Reinforcement Learning via Self-Supervision , 2021, ArXiv.

[71]  Matthias Boehm,et al.  SliceLine: Fast, Linear-Algebra-based Slice Finding for ML Model Debugging , 2021, SIGMOD Conference.

[72]  Hongfu Liu,et al.  Fairness-Aware Unsupervised Feature Selection , 2021, CIKM.

[73]  Dawn Song,et al.  Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification? , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Pierre Blanchart,et al.  An exact counterfactual-example-based approach to tree-ensemble models interpretability , 2021, ArXiv.

[75]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[76]  Eduard Hovy,et al.  A Survey of Data Augmentation Approaches for NLP , 2021, FINDINGS.

[77]  Praveen K. Paritosh,et al.  “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI , 2021, CHI.

[78]  Kang Min Yoo,et al.  GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation , 2021, EMNLP.

[79]  Haifeng Jin,et al.  AutoOD: Neural Architecture Search for Outlier Detection , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[80]  A. Globerson,et al.  BERTese: Learning to Speak to BERT , 2021, EACL.

[81]  Miguel 'A. Carreira-Perpin'an,et al.  Counterfactual Explanations for Oblique Decision Trees: Exact, Efficient Algorithms , 2021, AAAI.

[82]  Hao Guan,et al.  Domain Adaptation for Medical Image Analysis: A Survey , 2021, IEEE Transactions on Biomedical Engineering.

[83]  Chuizheng Meng,et al.  MIMIC-IF: Interpretability and Fairness Evaluation of Deep Learning Models on MIMIC-IV Dataset , 2021, ArXiv.

[84]  Xia Hu,et al.  Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments , 2021, ICLR.

[85]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[86]  Hinrich Schütze,et al.  Few-Shot Text Generation with Pattern-Exploiting Training , 2020, ArXiv.

[87]  Pang Wei Koh,et al.  WILDS: A Benchmark of in-the-Wild Distribution Shifts , 2020, ICML.

[88]  Sivan Sabato,et al.  Active Feature Selection for the Mutual Information Criterion , 2020, AAAI.

[89]  Eric Xing,et al.  Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling , 2020, ICLR.

[90]  Christopher Ré,et al.  No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems , 2020, NeurIPS.

[91]  Wei Cao,et al.  MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler , 2020, NeurIPS.

[92]  Hamid R. Arabnia,et al.  A Brief Review of Domain Adaptation , 2020, Advances in Data Science and Information Engineering.

[93]  Yanwen Chong,et al.  Graph-based semi-supervised learning: A review , 2020, Neurocomputing.

[94]  Mucahid Kutlu,et al.  Annotator Rationales for Labeling Tasks in Crowdsourcing , 2020, J. Artif. Intell. Res..

[95]  Diego Martinez,et al.  TODS: An Automated Time Series Outlier Detection System , 2020, AAAI.

[96]  Xia Hu,et al.  Meta-AAD: Active Anomaly Detection with Deep Reinforcement Learning , 2020, 2020 IEEE International Conference on Data Mining (ICDM).

[97]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[98]  Yanjun Qi,et al.  Searching for a Search Method: Benchmarking Search Algorithms for Generating NLP Adversarial Examples , 2020, BLACKBOXNLP.

[99]  Zhihui Li,et al.  A Survey of Deep Active Learning , 2020, ACM Comput. Surv..

[100]  Brian Kenji Iwana,et al.  An empirical survey of data augmentation for time series classification with neural networks , 2020, PloS one.

[101]  Xia Hu,et al.  RLCard: A Platform for Reinforcement Learning in Card Games , 2020, IJCAI.

[102]  Qingquan Song,et al.  Multi-Channel Graph Neural Networks , 2020, IJCAI.

[103]  Sameep Mehta,et al.  Overview and Importance of Data Quality for Machine Learning Tasks , 2020, KDD.

[104]  Hiroki Arimura,et al.  DACE: Distribution-Aware Counterfactual Explanation by Mixed-Integer Linear Optimization , 2020, IJCAI.

[105]  Xia Hu,et al.  Policy-GNN: Aggregation Optimization for Graph Neural Networks , 2020, KDD.

[106]  Alexander van Renen,et al.  Benchmarking learned indexes , 2020, Proc. VLDB Endow..

[107]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[108]  Xiao Huang,et al.  Towards Deeper Graph Neural Networks with Differentiable Group Normalization , 2020, NeurIPS.

[109]  Quoc V. Le,et al.  Rethinking Pre-training and Self-training , 2020, NeurIPS.

[110]  Marta Indulska,et al.  Building Data Curation Processes with Crowd Intelligence , 2020, CAiSE Forum.

[111]  Marcin Blachnik,et al.  Comparison of Instance Selection and Construction Methods with Various Classifiers , 2020, Applied Sciences.

[112]  Hang Su,et al.  Benchmarking Adversarial Robustness on Image Classification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Xia Hu,et al.  Dual Policy Distillation , 2020, IJCAI.

[114]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[115]  Sainyam Galhotra,et al.  Adaptive Rule Discovery for Labeling Text Data , 2020, SIGMOD Conference.

[116]  Xiao Zhang,et al.  Active Incremental Feature Selection Using a Fuzzy-Rough-Set-Based Information Entropy , 2020, IEEE Transactions on Fuzzy Systems.

[117]  Diyi Yang,et al.  MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification , 2020, ACL.

[118]  Bernd Bischl,et al.  Multi-Objective Counterfactual Explanations , 2020, PPSN.

[119]  Norman W. Paton,et al.  Dataset Discovery in Data Lakes , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[120]  Kyung-Ah Sohn,et al.  Rethinking Data Augmentation for Image Super-resolution: A Comprehensive Analysis and a New Strategy , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[121]  Le Gruenwald,et al.  Online Index Selection Using Deep Reinforcement Learning for a Cluster Database , 2020, 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW).

[122]  Prithviraj Sen,et al.  A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching , 2020, SIGMOD Conference.

[123]  Xiaomin Song,et al.  Time Series Data Augmentation for Deep Learning: A Survey , 2020, IJCAI.

[124]  James Zou,et al.  A Distributional Framework for Data Valuation , 2020, ICML.

[125]  Bernhard Schölkopf,et al.  Algorithmic Recourse: from Counterfactual Explanations to Interventions , 2020, FAccT.

[126]  Ahmet Murat Ozbayoglu,et al.  Deep Learning for Financial Applications : A Survey , 2020, Appl. Soft Comput..

[127]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[128]  Patrícia J. Bota,et al.  TSFEL: Time Series Feature Extraction Library , 2020, SoftwareX.

[129]  Frank F. Xu,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[130]  M. de Rijke,et al.  FOCUS: Flexible Optimizable Counterfactual Explanations for Tree Ensembles , 2019, AAAI.

[131]  Lior Rokach,et al.  DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering , 2019, KDD.

[132]  Daochen Zha,et al.  PyODDS: An End-to-end Outlier Detection System with Automated Machine Learning , 2019, WWW.

[133]  Ekaba Bisong,et al.  Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners , 2019 .

[134]  Andreas Kerren,et al.  Toward a Quantitative Survey of Dimension Reduction Techniques , 2019, IEEE Transactions on Visualization and Computer Graphics.

[135]  Peter A. Flach,et al.  FACE: Feasible and Actionable Counterfactual Explanations , 2019, AIES.

[136]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[137]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[138]  Denis Gracanin,et al.  A Comparison of Radial and Linear Charts for Visualizing Daily Patterns , 2019, IEEE Transactions on Visualization and Computer Graphics.

[139]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[140]  Daochen Zha,et al.  Experience Replay Optimization , 2019, IJCAI.

[141]  Matias Barenstein,et al.  ProPublica's COMPAS Data Revisited , 2019, ArXiv.

[142]  B. Recht,et al.  Do Image Classifiers Generalize Across Time? , 2019, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[143]  Guoliang Li,et al.  An End-to-End Learning-based Cost Estimator , 2019, Proc. VLDB Endow..

[144]  Amit Dhurandhar,et al.  Model Agnostic Contrastive Explanations for Structured Data , 2019, ArXiv.

[145]  Joydeep Ghosh,et al.  CERTIFAI: Counterfactual Explanations for Robustness, Transparency, Interpretability, and Fairness of Artificial Intelligence models , 2019, ArXiv.

[146]  Marco F. Huber,et al.  Benchmark and Survey of Automated Machine Learning Frameworks , 2019, J. Artif. Intell. Res..

[147]  Sanjay Krishnan,et al.  AlphaClean: Automatic Generation of Data Cleaning Pipelines , 2019, ArXiv.

[148]  Quoc V. Le,et al.  Using Videos to Evaluate Image Model Robustness , 2019, ArXiv.

[149]  Yue Zhang,et al.  CleanML: A Benchmark for Joint Data Cleaning and Machine Learning [Experiments and Analysis] , 2019, ArXiv.

[150]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[151]  Kamyar Azizzadenesheli,et al.  Regularized Learning for Domain Adaptation under Label Shifts , 2019, ICLR.

[152]  Ayodeji Olalekan Salau,et al.  Feature Extraction: A Survey of the Types, Techniques, Applications , 2019, 2019 International Conference on Signal Processing and Communication (ICSC).

[153]  Antonio Carlos de Francisco,et al.  Data Mining and Machine Learning to Promote Smart Cities: A Systematic Review from 2000 to 2018 , 2019, Sustainability.

[154]  Xue Ying,et al.  An Overview of Overfitting and its Solutions , 2019, Journal of Physics: Conference Series.

[155]  H. V. Jagadish,et al.  Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[156]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[157]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[158]  Felix Bießmann,et al.  Automating Large-Scale Data Quality Verification , 2018, Proc. VLDB Endow..

[159]  Tim Kraska,et al.  Slice Finder: Automated Data Slicing for Model Validation , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[160]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[161]  Júlio C. Nievola,et al.  An Adaptive Approach for Index Tuning with Learning Classifier Systems on Hybrid Storage Environments , 2018, HAIS.

[162]  Marie-Jeanne Lesot,et al.  Comparison-Based Inverse Classification for Interpretability in Machine Learning , 2018, IPMU.

[163]  Atul Prakash,et al.  Robust Physical-World Attacks on Deep Learning Visual Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[164]  James Y. Zou,et al.  Multiaccuracy: Black-Box Post-Processing for Fairness in Classification , 2018, AIES.

[165]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[166]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[167]  Munther A. Dahleh,et al.  A Marketplace for Data: An Algorithmic Solution , 2018, EC.

[168]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[169]  Tudor Dumitras,et al.  Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , 2018, NeurIPS.

[170]  Guoliang Li,et al.  DeepEye: Towards Automatic Data Visualization , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[171]  Renée J. Miller,et al.  Table Union Search on Open Data , 2018, Proc. VLDB Endow..

[172]  Sebastian Link,et al.  Data Quality: The Role of Empiricism , 2018, SGMD.

[173]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[174]  Nikolaos Doulamis,et al.  Deep Learning for Computer Vision: A Brief Review , 2018, Comput. Intell. Neurosci..

[175]  Sarah Webb Deep learning for biology , 2018, Nature.

[176]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[177]  Hayit Greenspan,et al.  Synthetic data augmentation using GAN for improved liver lesion classification , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[178]  D. Zha,et al.  Multi-label dataless text classification with topic modeling , 2017, Knowledge and Information Systems.

[179]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[180]  Georges G. Grinstein,et al.  Benchmark Development for the Evaluation of Visualization for Data Mining , 2017 .

[181]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[182]  Chris Russell,et al.  Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR , 2017, ArXiv.

[183]  Cherukuri Aswani Kumar,et al.  Intrusion detection model using fusion of chi-square feature selection and multi class SVM , 2017, J. King Saud Univ. Comput. Inf. Sci..

[184]  Deepak S. Turaga,et al.  Feature Engineering for Predictive Modeling using Reinforcement Learning , 2017, AAAI.

[185]  Lucila Ohno-Machado,et al.  A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge , 2017, Database J. Biol. Databases Curation.

[186]  Jinfeng Yi,et al.  ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models , 2017, AISec@CCS.

[187]  Xin Zhang,et al.  TFX: A TensorFlow-Based Production-Scale Machine Learning Platform , 2017, KDD.

[188]  Yu Zhang,et al.  Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[189]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[190]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[191]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[192]  Fisher Yu,et al.  Scribbler: Controlling Deep Image Synthesis with Sketch and Color , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[193]  Maria Jesus Martin,et al.  Uniclust databases of clustered and deeply annotated protein sequences and alignments , 2016, Nucleic Acids Res..

[194]  Marco Loog,et al.  A benchmark and comparison of active learning for logistic regression , 2016, Pattern Recognit..

[195]  Tim Oates,et al.  Time series classification from scratch with deep neural networks: A strong baseline , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[196]  Zhongheng Zhang,et al.  Missing data imputation: focusing on single imputation. , 2016, Annals of translational medicine.

[197]  Hua Ouyang,et al.  Learning to Rewrite Queries , 2016, CIKM.

[198]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[199]  Jeffrey F. Naughton,et al.  To Join or Not to Join?: Thinking Twice about Joins before Feature Selection , 2016, SIGMOD Conference.

[200]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[201]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[202]  Kanit Wongsuphasawat,et al.  Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations , 2016, IEEE Transactions on Visualization and Computer Graphics.

[203]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[204]  Seyed-Mohsen Moosavi-Dezfooli,et al.  DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[205]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[206]  Taghi M. Khoshgoftaar,et al.  Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[207]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[208]  David Zhang,et al.  Feature selection and analysis on correlated gas sensor data with recursive feature elimination , 2015 .

[209]  Carsten Binnig,et al.  RODI: A Benchmark for Automatic Mapping Generation in Relational-to-Ontology Data Integration , 2015, ESWC.

[210]  Huan Liu,et al.  Embedded Unsupervised Feature Selection , 2015, AAAI.

[211]  Felix Naumann,et al.  Estimating the Number and Sizes of Fuzzy-Duplicate Clusters , 2014, CIKM.

[212]  Aditya G. Parameswaran,et al.  DataHub: Collaborative Data Science & Dataset Version Management at Scale , 2014, CIDR.

[213]  Tilmann Rabl,et al.  TPC-DI: The First Industry Benchmark for Data Integration , 2014, Proc. VLDB Endow..

[214]  Zahir Tari,et al.  A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis , 2014, IEEE Transactions on Emerging Topics in Computing.

[215]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[216]  Hanspeter Pfister,et al.  What Makes a Visualization Memorable? , 2013, IEEE Transactions on Visualization and Computer Graphics.

[217]  Fabio Roli,et al.  Evasion Attacks against Machine Learning at Test Time , 2013, ECML/PKDD.

[218]  Paolo Papotti,et al.  Discovering Denial Constraints , 2013, Proc. VLDB Endow..

[219]  Raúl A. Santelices,et al.  Quantitative program slicing: Separating statements by relevance , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[220]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[221]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[222]  Kwong-Sak Leung,et al.  A Survey of Crowdsourcing Systems , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[223]  S. Sudarshan,et al.  DBridge: A program rewrite tool for set-oriented query execution , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[224]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[225]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[226]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[227]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[228]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[229]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[230]  Ohad Shamir,et al.  Vox Populi: Collecting High-Quality Labels from a Crowd , 2009, COLT.

[231]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[232]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[233]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[234]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[235]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..

[236]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[237]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[238]  Yan Zhou,et al.  Democratic co-learning , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[239]  Marek Grochowski,et al.  Comparison of Instance Selection Algorithms II. Results and Comments , 2004, ICAISC.

[240]  Miguel Toro,et al.  Finding representative patterns with ordered projections , 2003, Pattern Recognit..

[241]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[242]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[243]  Gilbert Saporta,et al.  Data fusion and data grafting , 2002 .

[244]  Taizo Shirai,et al.  Data discovery system , 2001 .

[245]  Daniel C. Zilio,et al.  DB2 advisor: an optimizer smart enough to recommend its own indexes , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[246]  Jim Gray,et al.  Microsoft TerraServer: a spatial data warehouse , 1999, SIGMOD '00.

[247]  H. Zeng,et al.  Stratal slicing, Part II : Real 3-D seismic data , 1998 .

[248]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[249]  Robert P. Goldman,et al.  Imputation of Missing Data Using Machine Learning Techniques , 1996, KDD.

[250]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[251]  Robert W. Blanning,et al.  Discovering implicit integrity constraints in rule bases using metagraphs , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[252]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986 .

[253]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artif. Intell..

[254]  Robert H. Riffenburgh,et al.  Linear Discriminant Analysis , 1960 .

[255]  Matthias Hirth,et al.  Human-AI Collaboration for Improving the Identification of Cars for Autonomous Driving , 2022, CIKM Workshops.

[256]  Rui Chen,et al.  An Information Fusion Approach to Learning with Instance-Dependent Label Noise , 2022, ICLR.

[257]  Fan Yang,et al.  Generalized Demographic Parity for Group Fairness , 2022, ICLR.

[258]  Yue Zhao,et al.  Revisiting Time Series Outlier Detection: Definitions and Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[259]  Peter Kellman,et al.  Cut out the annotator, keep the cutout: better segmentation with weak supervision , 2021, ICLR.

[260]  Xuanhe Zhou,et al.  DBMind: A Self-Driving Platform in openGauss , 2021, Proc. VLDB Endow..

[261]  Reynold Xin,et al.  Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics , 2021, CIDR.

[262]  Nadia Burkart,et al.  A Step Towards Global Counterfactual Explanations: Approximating the Feature Space Through Hierarchical Division and Graph Search , 2021, Adv. Artif. Intell. Mach. Learn..

[263]  M. Krasnyanskiy,et al.  Quality Assessment Method for GAN Based on Modified Metrics Inception Score and Fréchet Inception Distance , 2020 .

[264]  AnHai Doan,et al.  Data Curation with Deep Learning , 2020, EDBT.

[265]  Mitar Milutinovic On Evaluation of AutoML Systems , 2020 .

[266]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[267]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[268]  Ekaba Bisong,et al.  Introduction to Scikit-learn , 2019, Building Machine Learning and Deep Learning Models on Google Cloud Platform.

[269]  Michael Stonebraker,et al.  Data Integration: The Current Status and the Way Forward , 2018, IEEE Data Eng. Bull..

[270]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[271]  Yi Tay,et al.  Deep Learning based Recommender System: A Survey and New Perspectives , 2017, ArXiv.

[272]  Paolo Papotti,et al.  Benchmarking Data Curation Systems , 2016, IEEE Data Eng. Bull..

[273]  M. Zaharia,et al.  Apache Spark: a unified engine for big data processing , 2016, Commun. ACM.

[274]  Antony Selvadoss Thanamani,et al.  Feature Selection Based on Information Gain , 2013 .

[275]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[276]  Oliver J. Sutton,et al.  Introduction to k Nearest Neighbour Classification and Condensed Nearest Neighbour Data Reduction , 2012 .

[277]  Matthew Lease,et al.  Semi-Supervised Consensus Labeling for Crowdsourcing , 2011 .

[278]  Luc Desnoyers,et al.  Toward a Taxonomy of Visuals in Science Communication , 2011 .

[279]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[280]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[281]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[282]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.