General purpose computer-assisted clustering and conceptualization

We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an “insightful” or “useful” way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given dataset (along with millions of other solutions we add based on combinations of existing clusterings) and enable a user to explore and interact with it and quickly reveal or prompt useful or insightful conceptualizations. In addition, although it is uncommon to do so in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than expert human coders or many existing fully automated methods.

[1]  H. Eulau,et al.  The Puzzle of Representation: Specifying Components of Responsiveness , 1977 .

[2]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[3]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[4]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[5]  Warren E. Miller,et al.  Constituency Influence in Congress , 1963, American Political Science Review.

[6]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[7]  Dennis F. Thompson,et al.  Democracy and Disagreement , 1996 .

[8]  H. Pitkin The Concept of Representation , 1969 .

[9]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[10]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[11]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[12]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[13]  Diana Evans Yiannakis House Members' Communication Styles: Newsletters and Press Releases , 1982, The Journal of Politics.

[14]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[15]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[16]  Morris P. Fiorina,et al.  Congress, keystone of the Washington establishment , 1977 .

[17]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[18]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[19]  Sharad Goel,et al.  HORSESHOES IN MULTIDIMENSIONAL SCALING AND LOCAL KERNEL METHODS , 2008, 0811.1477.

[20]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Rebecca Green,et al.  Typologies and taxonomies: An introduction to classification techniques , 1996 .

[22]  Gary King,et al.  A Method of Automated Nonparametric Content Analysis for Social Science , 2010 .

[23]  Anil K. Jain,et al.  Multiobjective data clustering , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[24]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[25]  J. Armstrong,et al.  Derivation of Theory by Means of Factor Analysis or Tom Swift and His Electric Factor Analysis Machine , 2015 .