Asterisk

Labeling datasets is one of the most expensive bottlenecks in data preprocessing tasks in machine learning. Therefore, organizations, in many domains, are applying weak supervision to produce noisy labels. However, since weak supervision relies on cheaper sources, the quality of the generated labels is problematic. Therefore, in this article, we present Asterisk, an end-to-end framework to generate high-quality, large-scale labeled datasets. The system, first, automatically generates heuristics to assign initial labels. Then, the framework applies a novel data-driven active learning process to enhance the labeling quality. We present an algorithm that learns the selection policy by accommodating the modeled accuracies of the heuristics, along with the outcome of the generative model. Finally, the system employs the output of the active learning process to enhance the quality of the labels. To evaluate the proposed system, we report its performance against four state-of-the-art techniques. In collaboration with our industrial partner, IBM, we test the framework within a wide range of real-world applications. The experiments include 10 datasets of varying sizes with a maximum size of 11 million records. The results illustrate the effectiveness of the framework in producing high-quality labels and achieving high classification accuracy with minimal annotation efforts.

[1]  Christopher De Sa,et al.  DeepDive: Declarative Knowledge Base Construction , 2016, SGMD.

[2]  Christian Igel,et al.  Active learning with support vector machines , 2014, WIREs Data Mining Knowl. Discov..

[3]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[4]  C. Skinner,et al.  Determination of confidence intervals in non-normal data: application of the bootstrap to cocaine concentration in femoral blood. , 2015, Journal of analytical toxicology.

[5]  Luis M. Candanedo,et al.  Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models , 2016 .

[6]  Michael Stonebraker,et al.  Smile: A System to Support Machine Learning on EEG Data at Scale , 2019, Proc. VLDB Endow..

[7]  Gholamreza Haffari,et al.  Learning How to Actively Learn: A Deep Imitation Learning Approach , 2018, ACL.

[8]  Maria-Florina Balcan,et al.  The true sample complexity of active learning , 2010, Machine Learning.

[9]  Martin Müller,et al.  Towards User‐Centered Active Learning Algorithms , 2018, Comput. Graph. Forum.

[10]  Lior Rokach,et al.  Ensemble learning: A survey , 2018, WIREs Data Mining Knowl. Discov..

[11]  Peng Hu,et al.  ASCENT: Active Supervision for Semi-Supervised Learning , 2020, IEEE Transactions on Knowledge and Data Engineering.

[12]  Matthieu Cord,et al.  SyMIL: MinMax Latent SVM for Weakly Labeled Data , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Zhi-Hua Zhou,et al.  A brief introduction to weakly supervised learning , 2018 .

[14]  Bin Li,et al.  A survey on instance selection for active learning , 2012, Knowledge and Information Systems.

[15]  Petr Savický,et al.  Methods for multidimensional event classification: A case study using images from a Cherenkov gamma-ray telescope , 2004 .

[16]  Shaikh Quader,et al.  M-Lean: An end-to-end development framework for predictive models in B2B scenarios , 2019, Inf. Softw. Technol..

[17]  C. L. Philip Chen,et al.  Adaptive least squares support vector machines filter for hand tremor canceling in microsurgery , 2011, Int. J. Mach. Learn. Cybern..

[18]  Yi Yang,et al.  Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization , 2015, International Journal of Computer Vision.

[19]  Paulo Cortez,et al.  A data-driven approach to predict the success of bank telemarketing , 2014, Decis. Support Syst..

[20]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..