Structured Labeling to Facilitate Concept Evolution in Machine Learning

Labeling data is a seemingly simple task required for training many machine learning systems, but is actually fraught with problems. This paper introduces the notion of concept evolution, the changing nature of a person’s underlying concept (the abstract notion of the target class a person is labeling for, e.g., spam email, travel related web pages) which can result in inconsistent labels and thus be detrimental to machine learning. We introduce two structured labeling solutions, a novel technique we propose for helping people define and refine their concept in a consistent manner as they label. Through a series of five experiments, including a controlled lab study, we illustrate the impact and dynamics of concept evolution in practice and show that structured labeling helps people label more consistently in the presence of concept evolution than traditional labeling. Author

[1]  Leslie G. Valiant,et al.  Learning Disjunction of Conjunctions , 1985, IJCAI.

[2]  Stuart K. Card,et al.  The cost structure of sensemaking , 1993, INTERCHI.

[3]  Mary Czerwinski,et al.  Data mountain: using spatial memory for document management , 1998, UIST '98.

[4]  Geoffrey I. Webb,et al.  On the effect of data set size on bias and variance in classification learning , 1999 .

[5]  Mary Czerwinski,et al.  Visualizing implicit queries for information management and retrieval , 1999, CHI '99.

[6]  Eyal Kushilevitz,et al.  PAC learning with nasty noise , 1999, Theoretical Computer Science.

[7]  Michael J. Pazzani,et al.  A hybrid user model for news story classification , 1999 .

[8]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[9]  STEVE WHITTAKER,et al.  The character, value, and management of personal paper archives , 2001, TCHI.

[10]  Alan F. Blackwell,et al.  First steps in programming: a rationale for attention investment models , 2002, Proceedings IEEE 2002 Symposia on Human Centric Computing Languages and Environments.

[11]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[12]  Pia Borlund,et al.  The concept of relevance in IR , 2003, J. Assoc. Inf. Sci. Technol..

[13]  Kenneth O. Stanley Learning Concept Drift with a Committee of Decision Trees , 2003 .

[14]  Mads Haahr,et al.  A Case-Based Approach to Spam Filtering that Can Track Concept Drift , 2003 .

[15]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[16]  Susan T. Dumais,et al.  Newsjunkie: providing personalized newsfeeds via analysis of information novelty , 2004, WWW '04.

[17]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[18]  S. Kotsiantis Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[19]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[20]  Masataka Goto,et al.  An Efficient Hybrid Music Recommender System Using an Incrementally Trainable Probabilistic Generative Model , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[22]  David Maxwell Chickering,et al.  Learning consensus opinion: mining data from a labeling game , 2009, WWW '09.

[23]  Jaime Teevan,et al.  Visual snippets: summarizing web pages for search and revisitation , 2009, CHI.

[24]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[25]  Steven M. Drucker,et al.  Assisting Users with Clustering Tasks by Combining Metric Learning and Classification , 2010, AAAI.

[26]  Tom Mitchell,et al.  Learning to Tag using Noisy Labels , 2010 .

[27]  S. Yih,et al.  Similarity Models for Ad Relevance Measures , 2010 .

[28]  Ratul Mahajan,et al.  CueT: human-guided fast and accurate network alarm triage , 2011, CHI.

[29]  Drew Conway,et al.  Machine Learning for Email: Spam Filtering and Priority Inbox , 2011 .

[30]  Meredith Ringel Morris,et al.  Sensemaking in Collaborative Web Search , 2011, Hum. Comput. Interact..

[31]  Christian Schlee Definitions and Related Work , 2013 .

[32]  Maya Cakmak,et al.  Power to the People: The Role of Humans in Interactive Machine Learning , 2014, AI Mag..

[33]  M. Cugmas,et al.  On comparing partitions , 2015 .