Challenge AI Mind: A Crowd System for Proactive AI Testing

Artificial Intelligence (AI) has burrowed into our lives in various aspects; however, without appropriate testing, deployed AI systems are often being criticized to fail in critical and embarrassing cases. Existing testing approaches mainly depend on fixed and pre-defined datasets, providing a limited testing coverage. In this paper, we propose the concept of proactive testing to dynamically generate testing data and evaluate the performance of AI systems. We further introduce Challenge.AI, a new crowd system that features the integration of crowdsourcing and machine learning techniques in the process of error generation, error validation, error categorization, and error analysis. We present experiences and insights into a participatory design with AI developers. The evaluation shows that the crowd workflow is more effective with the help of machine learning techniques. AI developers found that our system can help them discover unknown errors made by the AI models, and engage in the process of proactive testing.

[1]  Preslav Nakov,et al.  SemEval-2013 Task 2: Sentiment Analysis in Twitter , 2013, *SEMEVAL.

[2]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[3]  Matthew E. Taylor,et al.  Towards Integrating Real-Time Crowd Advice with Reinforcement Learning , 2015, IUI Companion.

[4]  Dan Bohus,et al.  Crowdsourcing Language Generation Templates for Dialogue Systems , 2014, INLG.

[5]  Maxine Eskénazi,et al.  Toward better crowdsourced transcription: Transcription of a year of the Let's Go Bus Information System data , 2010, 2010 IEEE Spoken Language Technology Workshop.

[6]  Cynthia A. Brewer,et al.  ColorBrewer.org: An Online Tool for Selecting Colour Schemes for Maps , 2003 .

[7]  Sameep Mehta,et al.  Towards Crafting Text Adversarial Samples , 2017, ArXiv.

[8]  Yanjun Qi,et al.  Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers , 2018, 2018 IEEE Security and Privacy Workshops (SPW).

[9]  Walter S. Lasecki,et al.  Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection , 2017, ACL.

[10]  Rob Miller,et al.  Real-time crowd control of existing interfaces , 2011, UIST.

[11]  Walter S. Lasecki,et al.  Conversations in the Crowd: Collecting Data for Task-Oriented Dialog Learning , 2013, Proceedings of the AAAI Conference on Human Computation and Crowdsourcing.

[12]  Chenxi Zhang,et al.  TSCSet: A Crowdsourced Time-Sync Comment Dataset for Exploration of User Experience Improvement , 2018, IUI.

[13]  Sameer Singh,et al.  Generating Natural Adversarial Examples , 2017, ICLR.

[14]  Akshi Kumar,et al.  Sentiment Analysis on Twitter , 2012 .

[15]  Jinfeng Yi,et al.  Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples , 2018, AAAI.

[16]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[17]  Walter S. Lasecki,et al.  Real-time captioning by groups of non-experts , 2012, UIST.

[18]  Matteo Negri,et al.  Chinese Whispers: Cooperative Paraphrase Acquisition , 2012, LREC.

[19]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[20]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[21]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[22]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[23]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[24]  Timothy Chklovski,et al.  Collecting paraphrase corpora from volunteer contributors , 2005, K-CAP '05.

[25]  Eric Horvitz,et al.  Crowdsourcing the acquisition of natural language corpora: Methods and observations , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[26]  Iryna Gurevych,et al.  Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , 2018, ACL 2018.

[27]  Bob L. Sturm,et al.  Deep Learning and Music Adversaries , 2015, IEEE Transactions on Multimedia.

[28]  Melissa Densmore,et al.  Online Gaming for Crowd-sourcing Phrase-equivalents , 2014, COLING.

[29]  Xirong Li,et al.  Deep Text Classification Can be Fooled , 2017, IJCAI.

[30]  Olivia Buzek,et al.  Error Driven Paraphrase Annotation using Mechanical Turk , 2010, Mturk@HLT-NAACL.

[31]  Krzysztof Z. Gajos,et al.  Curiosity Killed the Cat, but Makes Crowdwork Better , 2016, CHI.

[32]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[33]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[34]  Benno Stein,et al.  Paraphrase acquisition via crowdsourcing and machine learning , 2013, TIST.

[35]  Ananthram Swami,et al.  Crafting adversarial input sequences for recurrent neural networks , 2016, MILCOM 2016 - 2016 IEEE Military Communications Conference.

[36]  Michael S. Bernstein,et al.  Boomerang: Rebounding the Consequences of Reputation Feedback on Crowdsourcing Platforms , 2016, UIST.

[37]  Patrick D. McDaniel,et al.  Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples , 2016, ArXiv.