Multi-level K-means text clustering technique for topic identification for competitor intelligence

Proliferation of web as an easily accessible information resource has led many corporations to gather competitor intelligence from the internet. While collection of such information is easy from internet, the collation and structuring of them for perusal of business decision makers, is a real trouble. Text clustering based topic identification techniques are expected to be very useful for such application. Using appropriate clustering technologies, the competitor intelligence corpus, gathered from the web, can be divided into topical groups and henceforth the analysis of this information becomes comparatively easier for the managers. This paper presents a study on the effectiveness of standard K-means text clustering algorithm applied at multiple levels, in a top-down, divide-and-conquer fashion, on competitor intelligence corpus, created from publicly available sources on the web, such as news, blogs, research papers etc. The paper also demonstrates the capability of Multi-level K-means (ML-KM) clustering technique to determine the optimal number of clusters as part of clustering process. The cluster validity metric used to determine cluster quality has also been explained along with other user-controlled configuration parameters. It is empirically found that ML-KM technique also addresses one problem of stand-alone standard K-means (S-KM), which is its bias towards convex, spherical clusters, resulting in bigger clusters subsuming smaller ones. This specific advantage of ML-KM over stand-alone S-KM to detect smaller clusters, makes it more suitable for clustering competitor intelligence related text corpus where niche, smaller clusters can actually lead to important findings. The experimental results are presented for both ML-KM and stand-alone S-KM clustering techniques based on competitor intelligence corpus as well as the standard Reuters corpus.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Ignazio Gallo,et al.  An online document clustering technique for short web contents , 2009, Pattern Recognit. Lett..

[3]  Le Minh Nguyen,et al.  Text analytics in industry: Challenges, desiderata and trends , 2016, Comput. Ind..

[4]  Gilbert L. Peterson,et al.  Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps , 2009, FLAIRS.

[5]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[6]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[7]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[10]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[11]  Nicole Beebe,et al.  Clustering digital forensic string search output , 2014, Digit. Investig..

[12]  Swapnajit Chakraborti Multi-document Text Summarization for Competitor Intelligence: A Methodology , 2014, 2014 2nd International Symposium on Computational and Business Intelligence.

[13]  Kurt Hornik,et al.  Spherical k-Means Clustering , 2012 .

[14]  Jerome R. Bellegarda,et al.  A novel word clustering algorithm based on latent semantic analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  Vadlamani Ravi,et al.  A survey on opinion mining and sentiment analysis: Tasks, approaches and applications , 2015, Knowl. Based Syst..

[16]  Ramayya Krishnan,et al.  Incremental hierarchical clustering of text documents , 2006, CIKM '06.

[17]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[18]  G. Bowden Wise,et al.  Multi-Document Summarization: Methodologies and Evaluations , 2000 .

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[21]  Shubhamoy Dey,et al.  Product news summarization for competitor intelligence using topic identification and artificial bee colony optimization , 2015, RACS.

[22]  Sheila Wright,et al.  Competitive intelligence in UK firms: a typology , 2002 .

[23]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[24]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[25]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[26]  B. P. Gautam,et al.  Document Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents , 2010 .

[27]  J. G. Skellam,et al.  A New Method for determining the Type of Distribution of Plant Individuals , 1954 .

[28]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[29]  Gabriela Masarova,et al.  The Need of Complex Competitive Intelligence , 2014 .

[30]  Shubhamoy Dey,et al.  Multi-document Text Summarization for Competitor Intelligence: A Methodology , 2014 .

[31]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[32]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[33]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .