Scalable k-NN based text clustering

Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.

[1]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[2]  Olivier Thonnard,et al.  Building k-nn graphs from large text data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[4]  Marc Dacier,et al.  A strategic analysis of spam botnets operations , 2011, CEAS '11.

[5]  Laura Ricci,et al.  Cracker: Crumbling large graphs into connected components , 2015, 2015 IEEE Symposium on Computers and Communication (ISCC).

[6]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[7]  Bernard Desgraupes Clustering Indices , 2016 .

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10]  Olivier Thonnard,et al.  MR-TRIAGE: Scalable multi-criteria clustering for big data security intelligence applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[11]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[12]  William W. Cohen,et al.  A Very Fast Method for Clustering Big Text Datasets , 2010, ECAI.

[13]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[14]  Zhiqiang Toh,et al.  DLIREC: Aspect Term Extraction and Term Polarity Classification System , 2014, *SEMEVAL.

[15]  Olivier Thonnard,et al.  Scalable Graph Building from Text Data , 2014, BigMine.

[16]  Kaizhu Huang,et al.  Fast kNN Graph Construction with Locality Sensitive Hashing , 2013, ECML/PKDD.

[17]  Ashish Goel,et al.  Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[18]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[19]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[22]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[23]  David R. Karger,et al.  Global min-cuts in RNC, and other ramifications of a simple min-out algorithm , 1993, SODA '93.

[24]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[25]  Silvio Lattanzi,et al.  Connected Components in MapReduce and Beyond , 2014, SoCC.

[26]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[28]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[29]  Qi Tian,et al.  Super-Bit Locality-Sensitive Hashing , 2012, NIPS.

[30]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[31]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .