You Are the Only Possible Oracle: Effective Test Selection for End Users of Interactive Machine Learning Systems

How do you test a program when only a single user, with no expertise in software testing, is able to determine if the program is performing correctly? Such programs are common today in the form of machine-learned classifiers. We consider the problem of testing this common kind of machine-generated program when the only oracle is an end user: e.g., only you can determine if your email is properly filed. We present test selection methods that provide very good failure rates even for small test suites, and show that these methods work in both large-scale random experiments using a “gold standard” and in studies with real users. Our methods are inexpensive and largely algorithm-independent. Key to our methods is an exploitation of properties of classifiers that is not possible in traditional software testing. Our results suggest that it is plausible for time-pressured end users to interactively detect failures-even very hard-to-find failures-without wading through a large number of successful (and thus less useful) tests. We additionally show that some methods are able to find the arguably most difficult-to-detect faults of classifiers: cases where machine learning algorithms have high confidence in an incorrect result.

[1]  Gregg Rothermel,et al.  A methodology for testing spreadsheets , 2001, TSEM.

[2]  Mary Shaw,et al.  Semantic anomaly detection in online data sources , 2002, ICSE '02.

[3]  Desney S. Tan,et al.  Interactive optimization for steering machine classification , 2010, CHI.

[4]  Weng-Keen Wong,et al.  Fixing the program my computer learned: barriers for end users, challenges for the machine , 2009, IUI.

[5]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[6]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[7]  David Leon,et al.  Tree-based methods for classifying software failures , 2004, 15th International Symposium on Software Reliability Engineering.

[8]  Raymond R. Panko,et al.  What we know about spreadsheet errors , 1998 .

[9]  Desney S. Tan,et al.  CueFlik: interactive concept learning in image search , 2008, CHI.

[10]  Alex Groce,et al.  Comparing non-adequate test suites using coverage criteria , 2013, ISSTA.

[11]  Thomas G. Dietterich,et al.  Active EM to reduce noise in activity recognition , 2007, IUI '07.

[12]  Todd Kulesza,et al.  Can feature design reduce the gender gap in end-user software development environments? , 2008, 2008 IEEE Symposium on Visual Languages and Human-Centric Computing.

[13]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[14]  Mark Harman,et al.  A Theoretical and Empirical Study of Search-Based Testing: Local, Global, and Hybrid Search , 2010, IEEE Transactions on Software Engineering.

[15]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[16]  Alex Groce,et al.  Lightweight Automated Testing with Adaptation-Based Programming , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[17]  Larry Wasserman,et al.  All of Statistics , 2004 .

[18]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[19]  Tsong Yueh Chen,et al.  Fault-based testing without the need of oracles , 2003, Inf. Softw. Technol..

[20]  Joe Tullio,et al.  How it works: a field study of non-technical users interacting with an intelligent system , 2007, CHI.

[21]  Weng-Keen Wong,et al.  Explanatory Debugging: Supporting End-User Debugging of Machine-Learned Programs , 2010, VL/HCC.

[22]  John A. Clark,et al.  Dynamic adaptive Search Based Software Engineering , 2012, Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement.

[23]  Mary Beth Rosson,et al.  Design Planning in End-User Web Development , 2007 .

[24]  Chris Murphy,et al.  An Approach to Software Testing of Machine Learning Applications , 2007, SEKE.

[25]  Tian Jiang,et al.  Personalized defect prediction , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[26]  Mark Harman,et al.  The role of Artificial Intelligence in Software Engineering , 2012, 2012 First International Workshop on Realizing AI Synergies in Software Engineering (RAISE).

[27]  Alex Groce,et al.  Mini-crowdsourcing end-user assessment of intelligent assistants: A cost-benefit study , 2011, 2011 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[28]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[29]  Gregg Rothermel,et al.  An empirical study of the effects of minimization on the fault detection capabilities of test suites , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[30]  Alex Groce,et al.  Where Are My Intelligent Assistant's Mistakes? A Systematic Testing Approach , 2011, IS-EUD.

[31]  David Leon,et al.  Pursuing failure: the distribution of program failures in a profile space , 2001, ESEC/FSE-9.

[32]  Desney S. Tan,et al.  EnsembleMatrix: interactive visualization to support machine learning with multiple classifiers , 2009, CHI.

[33]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[34]  BurnettMargaret,et al.  You Are the Only Possible Oracle , 2014 .

[35]  Rob Miller,et al.  Outlier finding: focusing user attention on possible errors , 2001, UIST '01.

[36]  Lionel C. Briand,et al.  Formal analysis of the effectiveness and predictability of random testing , 2010, ISSTA '10.

[37]  Anind K. Dey,et al.  Toolkit to support intelligibility in context-aware applications , 2010, UbiComp.

[38]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[39]  Gregg Rothermel,et al.  Test Case Prioritization: A Family of Empirical Studies , 2002, IEEE Trans. Software Eng..

[40]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[41]  Deborah L. McGuinness,et al.  Toward establishing trust in adaptive agents , 2008, IUI '08.

[42]  S. Hart,et al.  Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research , 1988 .

[43]  Gail E. Kaiser,et al.  Automatic system testing of programs without test oracles , 2009, ISSTA.

[44]  Jafar Adibi,et al.  The Enron Email Dataset Database Schema and Brief Statistical Report , 2004 .

[45]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[46]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[47]  Weng-Keen Wong,et al.  Why-oriented end-user debugging of naive Bayes text classification , 2011, ACM Trans. Interact. Intell. Syst..

[48]  Anind K. Dey,et al.  Why and why not explanations improve the intelligibility of context-aware intelligent systems , 2009, CHI.

[49]  Dino Isa,et al.  Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine , 2008, IEEE Transactions on Knowledge and Data Engineering.

[50]  Judith Segal Some Problems of Professional End User Developers , 2007 .

[51]  Elaine J. Weyuker,et al.  A Formal Analysis of the Fault-Detecting Ability of Testing Methods , 1993, IEEE Trans. Software Eng..

[52]  Phyllis G. Frankl,et al.  All-uses vs mutation testing: An experimental comparison of effectiveness , 1997, J. Syst. Softw..

[53]  Phyllis G. Frankl,et al.  An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing , 1993, IEEE Trans. Software Eng..

[54]  Baowen Xu,et al.  Application of Metamorphic Testing to Supervised Classifiers , 2009, 2009 Ninth International Conference on Quality Software.

[55]  HarmanMark,et al.  A Theoretical and Empirical Study of Search-Based Testing , 2010 .

[56]  Alex Groce,et al.  Taming compiler fuzzers , 2013, ACM-SIGPLAN Symposium on Programming Language Design and Implementation.

[57]  James Fogarty,et al.  Regroup: interactive machine learning for on-demand group creation in social networks , 2012, CHI.

[58]  Alan F. Blackwell,et al.  First steps in programming: a rationale for attention investment models , 2002, Proceedings IEEE 2002 Symposia on Human Centric Computing Languages and Environments.

[59]  R. Jones,et al.  Active Learning with Feedback on Both Features and Instances , 2006 .

[60]  Weng-Keen Wong,et al.  End-user feature labeling: a locally-weighted regression approach , 2011, IUI '11.

[61]  Geoffrey I. Webb,et al.  On the effect of data set size on bias and variance in classification learning , 1999 .

[62]  Christopher Scaffidi Unsupervised Inference of Data Formats in Human-Readable Notation , 2007, ICEIS.