Mining micro-blogging users’ interest features via fingerprint generation

Nowadays, micro-blogging is widely used as a communication and information sharing social network service, therefore mining micro-blogging users’ behavior features is very important both in the economic and social fields. A framework for the analysis of user’s interest features is proposed in this paper. After data cleaning, word segmentation, POS(part of speech) filtering and synonym merging, the keywords that called terms of all the tweets posted by a typical user in 2011 are extracted. Then VSM(vector space model) is used to generate the feature vector of the tweets from these terms. Furthermore, a k-bit binary called fingerprint is generated from the high dimensional feature vector of the tweets by use of Simhash algorithm. The micro-blogging user’s interest features and change patterns could be detected by analyzing the fingerprint sequences and the distance between the adjacent two fingerprints. Taking Sina micro-blogging as background, a series of experiments are done to prove the effectiveness of the algorithms. Keywords-micro-blogging;interest feature; tweet; fingerprint