A Human-in-the-loop Attribute Design Framework for Classification

In this paper, we present a semi-automated, “human-in-the-loop” framework for attribute design that assists human analysts to transform raw attributes into effective derived attributes for classification problems. Our proposed framework is optimization guided and fully agnostic to the underlying classification model. We present an algebra with various operators (arithmetic, relational, and logical) to transform raw attributes into derived attributes and solve two technical problems: (a) the top-k buckets design problem aims at presenting human analysts with k buckets, each bucket containing promising choices of raw attributes that she can focus on only without having to look at all raw attributes; and (b) the top-l snippets generation problem, which iteratively aids human analysts with top-l derived attributes involving an attribute. For the former problem, we present an effective exact bottom-up algorithm that is empowered by pruning capability, as well as random walk based heuristic algorithms that are intuitive and work well in practice. For the latter, we present a greedy heuristic algorithm that is scalable and effective. Rigorous evaluations are conducted involving 6 different real world datasets to showcase that our framework generates effective derived attributes compared to fully manual or fully automated methods.

[1]  Hector Garcia-Molina,et al.  Hybrid Strategies for Finding the Max with the Crowd: Technical Report , 2014 .

[2]  Jennifer Widom,et al.  CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[3]  Jennifer Widom,et al.  Optimal Crowd-Powered Rating and Filtering Algorithms , 2014, Proc. VLDB Endow..

[4]  Kalyan Veeramachaneni,et al.  FeatureHub: Towards Collaborative Data Science , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[5]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[6]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[7]  Wentian Li Mutual information functions versus correlation functions , 1990 .

[8]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[9]  Christopher Ré,et al.  Brainwash: A Data System for Feature Engineering , 2013, CIDR.

[10]  Dacheng Tao,et al.  Active Learning for Crowdsourcing Using Knowledge Transfer , 2014, AAAI.

[11]  D. Doermann,et al.  Combining preference and absolute judgements in a crowd-sourced setting , 2013 .

[12]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[13]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[14]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[15]  Michael J. Cafarella,et al.  Input selection for fast feature engineering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[16]  Zhi-Hua Zhou,et al.  Active Learning from Crowds with Unsure Option , 2015, IJCAI.

[17]  Dawn Xiaodong Song,et al.  ExploreKit: Automatic Feature Generation and Selection , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[18]  Mokshay Madiman,et al.  On the entropy of sums , 2008, 2008 IEEE Information Theory Workshop.

[19]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[20]  Chien-Ju Ho,et al.  Adaptive Task Assignment for Crowdsourced Classification , 2013, ICML.

[21]  Tim Kraska,et al.  CrowdDB: Query Processing with the VLDB Crowd , 2011, Proc. VLDB Endow..

[22]  Benoît Frénay,et al.  Is mutual information adequate for feature selection in regression? , 2013, Neural Networks.

[23]  Khurana Udayan,et al.  Cognito: Automated Feature Engineering for Supervised Learning , 2016 .

[24]  Jeff Heaton,et al.  An empirical analysis of feature engineering for predictive modeling , 2016, SoutheastCon 2016.

[25]  Mausam,et al.  Crowdsourcing Multi-Label Classification for Taxonomy Creation , 2013, HCOMP.

[26]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[27]  Aditya G. Parameswaran,et al.  Crowd-powered find algorithms , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[28]  Sibo Wang,et al.  Crowd-Based Deduplication: An Adaptive Approach , 2015, SIGMOD Conference.

[29]  AnHai Doan,et al.  Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing , 2014, Proc. VLDB Endow..

[30]  Brian Eriksson,et al.  Learning to Top-K Search using Pairwise Comparisons , 2013, AISTATS.

[31]  Michael S. Bernstein,et al.  Flock: Hybrid Crowd-Machine Learning Classifiers , 2015, CSCW.

[32]  Pablo A. Estévez,et al.  A review of feature selection methods based on mutual information , 2013, Neural Computing and Applications.

[33]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[34]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[35]  Oznur Alkan,et al.  One button machine for automating feature engineering in relational databases , 2017, ArXiv.

[36]  Kalyan Veeramachaneni,et al.  Deep feature synthesis: Towards automating data science endeavors , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[37]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[38]  David R. Karger,et al.  Counting with the Crowd , 2012, Proc. VLDB Endow..

[39]  Rui Liu,et al.  Dynamic Hierarchical Classification for Patient Risk-of-Readmission , 2015, KDD.

[40]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[41]  Vikas Kumar,et al.  CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones , 2010, MobiSys '10.

[42]  Haim Kaplan,et al.  Answering Planning Queries with the Crowd , 2013, Proc. VLDB Endow..

[43]  Guizhen Yang,et al.  The complexity of mining maximal frequent itemsets and maximal frequent patterns , 2004, KDD.

[44]  Paul N. Bennett,et al.  Pairwise ranking aggregation in a crowdsourced setting , 2013, WSDM.

[45]  Thomas Pfeiffer,et al.  Adaptive Polling for Information Aggregation , 2012, AAAI.

[46]  Purnamrita Sarkar,et al.  Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning , 2014, Proc. VLDB Endow..

[47]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[48]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[49]  Jennifer Widom,et al.  Query Optimization over Crowdsourced Data , 2013, Proc. VLDB Endow..

[50]  Aditya G. Parameswaran,et al.  So who won?: dynamic max discovery with the crowd , 2012, SIGMOD Conference.

[51]  Adam Tauman Kalai,et al.  Crowdsourcing Feature Discovery via Adaptively Chosen Comparisons , 2015, HCOMP.

[52]  Carlos Castillo,et al.  AIDR: artificial intelligence for disaster response , 2014, WWW.