Thread detection in dynamic text message streams

Text message stream is a newly emerging type of Web data which is produced in enormous quantities with the popularity of Instant Messaging and Internet Relay Chat. It is beneficial for detecting the threads contained in the text stream for various applications, including information retrieval, expert recognition and even crime prevention. Despite its importance, not much research has been conducted so far on this problem due to the characteristics of the data in which the messages are usually very short and incomplete. In this paper, we present a stringent definition of the thread detection task and our preliminary solution to it. We propose three variations of a single-pass clustering algorithm for exploiting the temporal information in the streams. An algorithm based on linguistic features is also put forward to exploit the discourse structure information. We conducted several experiments to compare our approaches with some existing algorithms on a real dataset. The results show that all three variations of the single-pass algorithm outperform the basic single-pass algorithm. Our proposed algorithm based on linguistic features improves the performance relatively by 69.5% and 9.7% when compared with the basic single-pass algorithm and the best variation algorithm in terms of F1 respectively.

[1]  Jong Wook Kim,et al.  Topic segmentation of message hierarchies for indexing and navigation support , 2005, WWW '05.

[2]  Henry Tirri,et al.  Combining Topic Models and Social Networks for Chat Data Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[3]  James Allan,et al.  Text classification and named entities for new event detection , 2004, SIGIR '04.

[4]  W. Xi,et al.  Learning effective ranking functions for newsgroup search , 2004, SIGIR '04.

[5]  Susan Gauch,et al.  ChatTrack: Chat Room Topic Detection Using Classification , 2004, ISI.

[6]  Xiang Ji,et al.  Domain-independent text segmentation using anisotropic diffusion and dynamic programming , 2003, SIGIR.

[7]  Ata Kabán,et al.  Topic Identification in Dynamical Text by Complexity Pursuit , 2003, Neural Processing Letters.

[8]  Ata Kabán,et al.  A Dynamic Probabilistic Model to Visualise Topic Evolution in Text Streams , 2002, Journal of Intelligent Information Systems.

[9]  Luis Gravano,et al.  An investigation of linguistic features and clustering algorithms for topical document clustering , 2000, SIGIR '00.

[10]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[11]  Alan F. Smeaton,et al.  Progress in the Application of Natural Language Processing to Information Retrieval Tasks , 1992, Comput. J..

[12]  Gerard Salton,et al.  On the application of syntactic methodologies in automatic text analysis , 1989, SIGIR '89.

[13]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[14]  Alex Waibel,et al.  MEETING BROWSER: TRACKING AND SUMMARIZING MEETINGS , 2007 .

[15]  Faisal M. Khan,et al.  Mining Chat-room Conversations for Social and Semantic Interactions , 2002 .

[16]  Eiman Elnahrawy,et al.  Log-Based Chat Room Monitoring Using Text Categorization: A Comparative Study , 2002 .

[17]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[18]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .