Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing

Large-scale classification is an increasingly critical Big Data problem. So far, however, very little has been published on how this is done in practice. In this paper we describe Chimera, our solution to classify tens of millions of products into 5000+ product types at WalmartLabs. We show that at this scale, many conventional assumptions regarding learning and crowdsourcing break down, and that existing solutions cease to work. We describe how Chimera employs a combination of learning, rules (created by in-house analysts), and crowdsourcing to achieve accurate, continuously improving, and cost-effective classification. We discuss a set of lessons learned for other similar Big Data systems. In particular, we argue that at large scales crowdsourcing is critical, but must be used in combination with learning, rules, and in-house analysts. We also argue that using rules (in conjunction with learning) is a must, and that more research attention should be paid to helping analysts create and manage (tens of thousands of) rules more effectively.

[1]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[2]  Jaime G. Carbonell,et al.  Active Learning and Crowd-Sourcing for Machine Translation , 2010, LREC.

[3]  Neel Sundaresan,et al.  Item categorization in the e-commerce domain , 2011, CIKM '11.

[4]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[5]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[6]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[7]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[8]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[9]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Partha Pratim Talukdar,et al.  Improving Product Classification Using Images , 2011, 2011 IEEE 11th International Conference on Data Mining.

[12]  Bin Bi,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2012 .

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[17]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[18]  Douglas Turnbull,et al.  Tagging products using image classification , 2009, SIGIR.

[19]  Dan Shen,et al.  Large-scale item categorization for e-commerce , 2012, CIKM.

[20]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[21]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[22]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[23]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[24]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[25]  Ron Bekkerman,et al.  High-precision phrase-based document classification on a modern scale , 2011, KDD.

[26]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[27]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[28]  Kristen Grauman,et al.  Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds , 2011, CVPR 2011.