Comparative study of text clustering techniques in virtual worlds

Virt-UAM (Virtual Worlds at Universidad Autónoma de Madrid) platform allows to design and implement virtual spaces where a set of avatars can be intensively monitored using a set of tools which can be managed by an administrator. In a virtual world, the users can move and interact between them with a high degree of freedom. The movements, interactions and any other information related to the avatars conversations can be stored. Hence this data is available for processing and analysing to obtain the user behavioural patterns. Document clustering techniques have been intensively applied to automatically organize a document corpus into clusters or similar groups. The topic detection problem can be considered as a special case of document clustering, therefore, these techniques can be used over textual chat to detect clusters from the data, and then extract the conversation topics. Mahout(TM) machine learning library is an Apache(TM) project whose main goal is to build scalable machine learning libraries. This library provides a set of algorithms for data mining and for information retrieval ready to use. This paper shows a practical application of some of these available clustering mahout algorithms, in a virtual world-based scenario. These algorithms have been applied to extract the topics based on clusters obtained from the text messages. Finally, a comparative study of these document clustering algorithms used is presented.

[1]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[2]  Paul E. Black,et al.  Dictionary of Algorithms and Data Structures | NIST , 1998 .

[3]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[4]  Alessandro De Gloria,et al.  Exploring gaming mechanisms to enhance knowledge acquisition in virtual worlds , 2008, DIMEA.

[5]  Donald G. Bailey,et al.  An Efficient Euclidean Distance Transform , 2004, IWCIA.

[6]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[7]  Jason J. Jung,et al.  Emotion-based character clustering for managing story-based contents: a cinemetric analysis , 2012, Multimedia Tools and Applications.

[8]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[9]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[10]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[11]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[12]  Sean Owen,et al.  Mahout in Action , 2011 .

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  P. McCullagh,et al.  How many clusters , 2008 .

[15]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[16]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[17]  Hui Xiong,et al.  Hyperclique pattern discovery , 2006, Data Mining and Knowledge Discovery.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  María Dolores Rodríguez-Moreno,et al.  Clustering avatars behaviours from virtual worlds interactions , 2012, WI&C '12.

[20]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[21]  Daoqiang Zhang,et al.  Fuzzy clustering using kernel method , 2002 .

[22]  Billy Harris,et al.  The use of Second Life for distance education , 2008 .

[23]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[24]  S. Freitas Learning in immersive worlds: A review of game-based learning , 2006 .

[25]  Douglas M. Freimuth,et al.  Evaluating the Jaccard-Tanimoto Index on Multi-core Architectures , 2009, ICCS.

[26]  Bonnie A. Nardi,et al.  Learning Conversations in World of Warcraft , 2007, 2007 40th Annual Hawaii International Conference on System Sciences (HICSS'07).

[27]  Helena Ahonen-Myka Mining all maximal frequent word sequences in a set of sentences , 2005, CIKM '05.