An Empirical Comparison of Four Text Mining Methods

The amount of textual data that is available for researchers and businesses to analyze is increasing at a dramatic rate. This reality has led IS researchers to investigate various text mining techniques. This essay examines four text mining methods that are frequently used in order to identify their advantages and limitations. The four methods that we examine are (1) latent semantic analysis, (2) probabilistic latent semantic analysis, (3) latent Dirichlet allocation, and (4) the correlated topic model. We compare these four methods and highlight the optimal conditions under which to apply the various methods. Our paper sheds light on the theory that underlies text mining methods and provides guidance for researchers who seek to apply these methods.

[1]  C. Goutte,et al.  Co-Occurrence Models in Music Genre Classification , 2005, 2005 IEEE Workshop on Machine Learning for Signal Processing.

[2]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[3]  Andrew McCallum,et al.  Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email , 2007, J. Artif. Intell. Res..

[4]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[5]  F. Haight Handbook of the Poisson Distribution , 1967 .

[6]  Rong Yan,et al.  Joint Emotion-Topic Modeling for Social Affective Text Mining , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[7]  Dunja Mladenic,et al.  Semi-automatic Construction of Topic Ontologies , 2005, EWMF/KDO.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Robert J. Connor,et al.  Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution , 1969 .

[10]  Donna K. Harman,et al.  Overview of the First Text REtrieval Conference (TREC-1) , 1992, TREC.

[11]  Sofus A. Macskassy,et al.  More than Words: Quantifying Language to Measure Firms' Fundamentals the Authors Are Grateful for Assiduous Research Assistance from Jie Cao and Shuming Liu. We Appreciate Helpful Comments From , 2007 .

[12]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[13]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[14]  D. Aldous Exchangeability and related topics , 1985 .

[15]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[16]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[17]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[18]  C. Elkan,et al.  Topic Models , 2008 .

[19]  Chris H. Q. Ding,et al.  A probabilistic model for Latent Semantic Indexing , 2005, J. Assoc. Inf. Sci. Technol..

[20]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[21]  Takeo Kanade,et al.  Tracking in unstructured crowded scenes , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[23]  Sudarsun Santhiappan,et al.  Topic Models based Personalized Spam Filter , 2022 .

[24]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[25]  Kim E. A. Silverman,et al.  Automatic junk e-mail filtering based on latent content , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[26]  Fabio Stella,et al.  Automatic Labeling of Topics , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[27]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[28]  Xiang Cheng,et al.  Incremental probabilistic latent semantic analysis for automatic question recommendation , 2008, RecSys '08.

[29]  Wilfried N. Gansterer,et al.  Spam Filtering Based on Latent Semantic Indexing , 2008 .

[30]  José Carlos Cortizo,et al.  Email Spam Filtering , 2008, Adv. Comput..

[31]  Jin Zhang,et al.  Query Classification Based on Regularized Correlated Topic Model , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[32]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[33]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Chris Ding,et al.  A probabilistic model for Latent Semantic Indexing: Research Articles , 2005 .

[35]  Gerhard Paass,et al.  Improved Phishing Detection using Model-Based Features , 2008, CEAS.

[36]  Xiaoying Tai,et al.  Medical Image Retrieval Based on Latent Semantic Indexing , 2008, 2008 International Conference on Computer Science and Software Engineering.

[37]  William Strunk The Chicago Manual of Style/The Elements of Style , 2007 .

[38]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[39]  Erkki Sutinen,et al.  Applying Latent Dirichlet Allocation to Automatic Essay Grading , 2006, FinTAL.

[40]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[41]  Ronald Rosenfeld,et al.  Incorporating linguistic structure into statistical language models , 2000, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[42]  Meng Chang Chen,et al.  Using Incremental PLSI for Threshold-Resilient Online Event Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[43]  Rainer Lienhart,et al.  Multimodal pLSA on visual features and tags , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[44]  Ivan Titov,et al.  A Joint Model of Text and Aspect Ratings for Sentiment Summarization , 2008, ACL.

[45]  Choochart Haruechaiyasak,et al.  Expert identification for multidisciplinary R&D project collaboration , 2009, PICMET '09 - 2009 Portland International Conference on Management of Engineering & Technology.

[46]  Donna Harman,et al.  Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[47]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[48]  Erkki Sutinen,et al.  Comparison of Dimension Reduction Methods for Automated Essay Grading , 2008, J. Educ. Technol. Soc..

[49]  Jun Guo,et al.  Supervised Dual-PLSA for Personalized SMS Filtering , 2009, AIRS.

[50]  S. Logeswari,et al.  A Survey on Text Mining in Clustering , 2011 .

[51]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[52]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[53]  Rafael Berlanga Llavori,et al.  Topic discovery based on text mining techniques , 2007, Inf. Process. Manag..

[54]  Zhanting Yuan,et al.  Research of Spam Filtering System Based on LSA and SHA , 2008, ISNN.

[55]  Thomas W. Miller Data and Text Mining: A Business Applications Approach , 2004 .

[56]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[57]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[58]  Anna Sidorova,et al.  Uncovering the Intellectual Core of the Information Systems Discipline , 2008, MIS Q..

[59]  R. Lienhart,et al.  Correlated Topic Models for Image Retrieval , 2008 .

[60]  Mike Y. Chen,et al.  Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web , 2001 .

[61]  James T. Kwok,et al.  Mining customer product ratings for personalized marketing , 2003, Decis. Support Syst..

[62]  Jan Larsen,et al.  Temporal analysis of text data using latent variable models , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[63]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[64]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[65]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[66]  Yeni Herdiyeni,et al.  Image Semantic Extraction Using Latent Semantic Indexing on Image Retrieval Automatic-Annotation , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[67]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[68]  Wan-Shiou Yang,et al.  Discovering cohesive subgroups from social networks for targeted advertising , 2008, Expert Syst. Appl..

[69]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[70]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[71]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[72]  Dirk S. Hovorka,et al.  Analyzing unstructured text data: Using latent categorization to identify intellectual communities in information systems , 2008, Decis. Support Syst..

[73]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[74]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[75]  P. Lenk The Logistic Normal Distribution for Bayesian, Nonparametric, Predictive Densities , 1988 .