Towards Personalized Preprocessing Pipeline Search

Feature preprocessing, which transforms raw input features into numerical representations, is a crucial step in automated machine learning (AutoML) systems. However, the existing systems often have a very small search space for feature preprocessing with the same preprocessing pipeline applied to all the numerical features. This may result in sub-optimal performance since different datasets often have various feature characteristics, and features within a dataset may also have their own preprocessing preferences. To bridge this gap, we explore personalized preprocessing pipeline search, where the search algorithm is allowed to adopt a different preprocessing pipeline for each feature. This is a challenging task because the search space grows exponentially with more features. To tackle this challenge, we propose ClusterP3S, a novel framework for Personalized Preprocessing Pipeline Search via Clustering. The key idea is to learn feature clusters such that the search space can be significantly reduced by using the same preprocessing pipeline for the features within a cluster. To this end, we propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines, where the upper-level search optimizes the feature clustering to enable better pipelines built upon the clusters, and the lower-level search optimizes the pipeline given a specific cluster assignment. We instantiate this idea with a deep clustering network that is trained with reinforcement learning at the upper level, and random search at the lower level. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.

[1]  D. Zha,et al.  Active Ensemble Learning for Knowledge Graph Error Detection , 2023, WSDM.

[2]  D. Zha,et al.  Fairly Predicting Graft Failure in Liver Transplant for Organ Assigning , 2023, AMIA.

[3]  Fan Yang,et al.  Efficient XAI Techniques: A Taxonomic Survey , 2023, ArXiv.

[4]  Zaid Pervaiz Bhat,et al.  Data-centric AI: Perspectives and Challenges , 2023, SDM.

[5]  Rui Chen,et al.  Bring Your Own View: Graph Neural Networks for Link Prediction with Personalized Subgraph Selection , 2022, WSDM.

[6]  Mengnan Du,et al.  Mitigating Relational Bias on Knowledge Graphs , 2022, ArXiv.

[7]  A. Kejariwal,et al.  DreamShard: Generalizable Embedding Table Placement for Recommender Systems , 2022, Neural Information Processing Systems.

[8]  D. Zha,et al.  Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning , 2022, CIKM.

[9]  Mengnan Du,et al.  Towards Learning Disentangled Representations for Time Series , 2022, KDD.

[10]  Yi-An Ma,et al.  AutoShard: Automated Embedding Table Sharding for Recommender Systems , 2022, KDD.

[11]  Akihiro Kishimoto,et al.  Bandit Limited Discrepancy Search and Application to Machine Learning Pipeline Optimization , 2022, AAAI.

[12]  Fan Yang,et al.  Accelerating Shapley Explanation via Contributive Cooperator Selection , 2022, ICML.

[13]  U. Braga-Neto,et al.  Auto-PINN: Understanding and Optimizing Physics-Informed Neural Architecture , 2022, ArXiv.

[14]  F. Hutter,et al.  Automated Reinforcement Learning (AutoRL): A Survey and Open Problems , 2022, J. Artif. Intell. Res..

[15]  Ninghao Liu,et al.  Modeling Techniques for Machine Learning Fairness: A Survey , 2021, ArXiv.

[16]  D. Zha,et al.  Automated Anomaly Detection via Curiosity-Guided Search and Self-Imitation Learning , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Zaid Pervaiz Bhat,et al.  AutoVideo: An Automated Video Action Recognition System , 2021, IJCAI.

[18]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[19]  Xiangru Lian,et al.  DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning , 2021, ICML.

[20]  Xia Hu,et al.  Simplifying Deep Reinforcement Learning via Self-Supervision , 2021, ArXiv.

[21]  Haifeng Jin,et al.  AutoOD: Neural Architecture Search for Outlier Detection , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[22]  Marius Lindauer,et al.  Auto-Pytorch: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Xia Hu,et al.  Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments , 2021, ICLR.

[24]  Diego Martinez,et al.  TODS: An Automated Time Series Outlier Detection System , 2020, AAAI.

[25]  Xia Hu,et al.  Meta-AAD: Active Anomaly Detection with Deep Reinforcement Learning , 2020, 2020 IEEE International Conference on Data Mining (ICDM).

[26]  Madeleine Udell,et al.  AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space , 2020, KDD.

[27]  Xia Hu,et al.  RLCard: A Platform for Reinforcement Learning in Card Games , 2020, IJCAI.

[28]  F. Hutter,et al.  Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning , 2020, J. Mach. Learn. Res..

[29]  Xia Hu,et al.  Policy-GNN: Aggregation Optimization for Graph Neural Networks , 2020, KDD.

[30]  Xia Hu,et al.  Dual Policy Distillation , 2020, IJCAI.

[31]  Yu-Neng Chuang,et al.  Skewness Ranking Optimization for Personalized Recommendation , 2020, UAI 2020.

[32]  Lior Rokach,et al.  DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering , 2019, KDD.

[33]  Xia Hu,et al.  RLCard: A Toolkit for Reinforcement Learning in Card Games , 2019, ArXiv.

[34]  Daochen Zha,et al.  PyODDS: An End-to-end Outlier Detection System with Automated Machine Learning , 2019, WWW.

[35]  Bernd Bischl,et al.  An Open Source AutoML Benchmark , 2019, ArXiv.

[36]  Daochen Zha,et al.  Experience Replay Optimization , 2019, IJCAI.

[37]  Trang T. Le,et al.  Scaling tree-based automated machine learning to biomedical big data with a feature set selector , 2019, Bioinform..

[38]  Philip H. S. Torr,et al.  Alpha MAML: Adaptive Model-Agnostic Meta-Learning , 2019, ArXiv.

[39]  Alexander G. Gray,et al.  An ADMM Based Framework for AutoML Pipeline Configuration , 2019, AAAI.

[40]  Joaquin Vanschoren,et al.  Meta-Learning: A Survey , 2018, Automated Machine Learning.

[41]  Bruno Ribeiro,et al.  Oboe: auto-tuning video ABR algorithms to network conditions , 2018, SIGCOMM.

[42]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[43]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[44]  Bernd Bischl,et al.  Tunability: Importance of Hyperparameters of Machine Learning Algorithms , 2018, J. Mach. Learn. Res..

[45]  D. Zha,et al.  Multi-label dataless text classification with topic modeling , 2017, Knowledge and Information Systems.

[46]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[47]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[48]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[49]  Randal S. Olson,et al.  TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning , 2016, AutoML@ICML.

[50]  Francisco Herrera,et al.  Big data preprocessing: methods and prospects , 2016 .

[51]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[52]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[53]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[54]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[55]  Gilles Louppe,et al.  Independent consultant , 2013 .

[56]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[57]  M. Arthur Munson,et al.  A study on the importance of and time spent on different modeling steps , 2012, SKDD.

[58]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[59]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[60]  F. Mohr Towards Model Selection using Learning Curve Cross-Validation , 2021 .

[61]  Yue Zhao,et al.  Revisiting Time Series Outlier Detection: Definitions and Benchmarks , 2021, NeurIPS Datasets and Benchmarks.

[62]  Mitar Milutinovic On Evaluation of AutoML Systems , 2020 .

[63]  Aaron Klein,et al.  Hyperparameter Optimization , 2017, Encyclopedia of Machine Learning and Data Mining.