Crowd synthesis: extracting categories and clusters from complex data

Analysts synthesize complex, qualitative data to uncover themes and concepts, but the process is time-consuming, cognitively taxing, and automated techniques show mixed success. Crowdsourcing could help this process through on-demand harnessing of flexible and powerful human cognition, but incurs other challenges including limited attention and expertise. Further, text data can be complex, high-dimensional, and ill-structured. We address two major challenges unsolved in prior crowd clustering work: scaffolding expertise for novice crowd workers, and creating consistent and accurate categories when each worker only sees a small portion of the data. To address these challenges we present an empirical study of a two-stage approach to enable crowds to create an accurate and useful overview of a dataset: A) we draw on cognitive theory to assess how re-representing data can shorten and focus the data on salient dimensions; and B) introduce an iterative clustering approach that provides workers a global overview of data. We demonstrate a classification-plus-context approach elicits the most accurate categories at the most useful level of abstraction.

[1]  Gordon H. Bower,et al.  Learning and Applying Category Knowledge in Unsupervised Domains , 1991 .

[2]  Jeffrey Heer,et al.  Interpretation and trust: designing model-driven visualizations for text analysis , 2012, CHI.

[3]  Ivan Beschastnikh,et al.  Articulations of wikiwork: uncovering valued work in wikipedia through barnstars , 2008, CSCW.

[4]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[5]  Pietro Perona,et al.  Crowdclustering , 2011, NIPS.

[6]  K. Holyoak,et al.  Schema induction and analogical transfer , 1983, Cognitive Psychology.

[7]  R. Kolbe,et al.  Content-Analysis Research: An Examination of Applications with Directives for Improving Research Reliability and Objectivity , 1991 .

[8]  Stephen E. Robertson,et al.  Rethinking the ESP game , 2009, CHI Extended Abstracts.

[9]  J. Tanaka,et al.  Object categories and expertise: Is the basic level in the eye of the beholder? , 1991, Cognitive Psychology.

[10]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[11]  Bruce Thompson,et al.  Planned versus Unplanned and Orthogonal versus Nonorthogonal Contrasts: The Neo-Classical Perspective. , 1990 .

[12]  A. Young Sorting Things Out: Classification and Its Consequences. , 2001 .

[13]  Robert E. Kraut,et al.  Gender, topic, and audience response: an analysis of user-generated content on facebook , 2013, CHI.

[14]  L. Suchman Do categories have politics? The language/action perspective reconsidered , 1993 .

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Lydia B. Chilton,et al.  Cascade: crowdsourcing taxonomy creation , 2013, CHI.

[17]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Luis von Ahn,et al.  Human Computation for Attribute and Attribute Value Acquisition , 2011 .

[19]  Sergio Gómez,et al.  Solving Non-Uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms , 2006, J. Classif..

[20]  Stuart K. Card,et al.  The cost structure of sensemaking , 1993, INTERCHI.

[21]  Wayne D. Gray,et al.  Basic objects in natural categories , 1976, Cognitive Psychology.

[22]  Jinfeng Yi,et al.  Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach , 2012, HCOMP@AAAI.

[23]  K. Holyoak,et al.  Induction of category distributions: a framework for classification learning. , 1984, Journal of experimental psychology. Learning, memory, and cognition.

[24]  D L Medin,et al.  Presentation order and recognition of categorically related examples , 1994, Psychonomic bulletin & review.

[25]  David M. Blei,et al.  Visualizing Topic Models , 2012, ICWSM.

[26]  Jeffrey Heer,et al.  Strategies for crowdsourcing social data analysis , 2012, CHI.

[27]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[28]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[29]  Patrick Shafto,et al.  Development of categorization and reasoning in the natural world: novices to experts, naive similarity to ecological knowledge. , 2003, Journal of experimental psychology. Learning, memory, and cognition.

[30]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[31]  Paul J. Feltovich,et al.  Categorization and Representation of Physics Problems by Experts and Novices , 1981, Cogn. Sci..

[32]  Lydia B. Chilton,et al.  Community Clustering: Leveraging an Academic Crowd to Form Coherent Conference Sessions , 2013, HCOMP.

[33]  Chris North,et al.  Semantic interaction for visual text analytics , 2012, CHI.

[34]  Michael S. Bernstein,et al.  Who gives a tweet?: evaluating microblog content value , 2012, CSCW.

[35]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[36]  G. Murphy,et al.  Induction and category coherence , 1996, Psychonomic bulletin & review.

[37]  Jeffrey Heer,et al.  Termite: visualization techniques for assessing textual topic models , 2012, AVI.

[38]  Roger B. Dannenberg,et al.  TagATune: A Game for Music and Sound Annotation , 2007, ISMIR.

[39]  Harold Ossher,et al.  Guest Editors' Introduction: Studying Professional Software Design , 2012, IEEE Softw..

[40]  Adam Tauman Kalai,et al.  Adaptively Learning the Crowd Kernel , 2011, ICML.

[41]  Douglas L. Medin,et al.  Context theory of classification learning. , 1978 .