With the advancement of internet technology and the change in the mode of communication, it is found that much first-hand news have been discussed in Internet forums well before they are reported in traditional mass media. Also, this communication channel provides an effective channel for illegal activities such as dissemination of copyrighted movies, threatening messages and online gambling etc. The law enforcement agencies are looking for solutions to monitor these discussion forums for possible criminal activities and download suspected postings as evidence for investigation. The volume of postings is huge, for 10 popular forums in Hong Kong, we found that there are 300,000 new messages every day. In this paper, we propose an automatic system that tackles this problem. Our proposed system will download postings from selected discussion forums continuously and employ data mining techniques to identify hot topics and cluster authors into different groups using word-based user profiles. Difference techniques are applied to process the collected data and several ways are proposed to solve the problem.
[1]
Keh-Yih Su,et al.
A Preliminary Study On Unknown Word Problem In Chinese Word Segmentation
,
1993,
ROCLING/IJCLCLP.
[2]
Matthew Hurst,et al.
BlogPulse: Automated Trend Discovery for Weblogs
,
2003
.
[3]
Malú Castellanos.
HotMiner: Discovering Hot Topics from Dirty Text
,
2004
.
[4]
H. Varian,et al.
Predicting the Present with Google Trends
,
2009
.
[5]
Fredric C. Gey,et al.
Chinese text retrieval without using a dictionary
,
1997,
SIGIR '97.
[6]
Le Zhang,et al.
Statistical Substring Reduction in Linear Time
,
2004,
IJCNLP.