论文信息 - Scalable k-NN based text clustering

Scalable k-NN based text clustering

Clustering items using textual features is an important problem with many applications, such as root-cause analysis of spam campaigns, as well as identifying common topics in social media. Due to the sheer size of such data, algorithmic scalability becomes a major concern. In this work, we present our approach for text clustering that builds an approximate k-NN graph, which is then used to compute connected components representing clusters. Our focus is to understand the scalability / accuracy tradeoff that underlies our method: we do so through an extensive experimental campaign, where we use real-life datasets, and show that even rough approximations of k-NN graphs are sufficient to identify valid clusters. Our method is scalable and can be easily tuned to meet requirements stemming from different application domains.

[1] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[2] Olivier Thonnard,et al. Building k-nn graphs from large text data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[3] George Karypis,et al. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[4] Marc Dacier,et al. A strategic analysis of spam botnets operations , 2011, CEAS '11.

[5] Laura Ricci,et al. Cracker: Crumbling large graphs into connected components , 2015, 2015 IEEE Symposium on Computers and Communication (ISCC).

[6] Mitsuru Ishizuka,et al. Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[7] Bernard Desgraupes. Clustering Indices , 2016 .

[8] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9] Ashwin Machanavajjhala,et al. Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[10] Olivier Thonnard,et al. MR-TRIAGE: Scalable multi-criteria clustering for big data security intelligence applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[11] George Karypis,et al. Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[12] William W. Cohen,et al. A Very Fast Method for Clustering Big Text Datasets , 2010, ECAI.

[13] Din J. Wasem,et al. Mining of Massive Datasets , 2014 .

[14] Zhiqiang Toh,et al. DLIREC: Aspect Term Extraction and Term Polarity Classification System , 2014, *SEMEVAL.

[15] Olivier Thonnard,et al. Scalable Graph Building from Text Data , 2014, BigMine.

[16] Kaizhu Huang,et al. Fast kNN Graph Construction with Locality Sensitive Hashing , 2013, ECML/PKDD.

[17] Ashish Goel,et al. Dimension independent similarity computation , 2012, J. Mach. Learn. Res..

[18] Hila Becker,et al. Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[19] Martin Ester,et al. Frequent term-based text clustering , 2002, KDD.

[20] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21] Matthew A. Jaro,et al. Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[22] Santo Fortunato,et al. Community detection in graphs , 2009, ArXiv.

[23] David R. Karger,et al. Global min-cuts in RNC, and other ramifications of a simple min-out algorithm , 1993, SODA '93.

[24] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[25] Silvio Lattanzi,et al. Connected Components in MapReduce and Beyond , 2014, SoCC.

[26] M. Newman,et al. Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[27] Charu C. Aggarwal,et al. Mining Text Data , 2012, Springer US.

[28] Hui Xiong,et al. Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[29] Qi Tian,et al. Super-Bit Locality-Sensitive Hashing , 2012, NIPS.

[30] Kai Li,et al. Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[31] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .