Automatic Detection of Social Tag Spams Using a Text Mining Approach

Social tags are annotations for Web pages collaboratively added by users. It will be much easier to understand the meaning of Web pages and classify them according to their tags. The precision in retrieving Web pages may also increase using such tags. Nowadays social tags are mostly annotated manually by users via social bookmarking Web sites. Such manual annotation process may produce diverse, redundant, and inconsistent tags. Besides, many tags which are inconsistent with their annotated Web pages exist and deteriorate the feasibility of social tags. In this work we will develop an automatic scheme to discover the associations between Web pages and social tags and apply such associations on applications of social tag spam detection. We applied a text mining approach based on self-organizing maps to find the relationships between Web pages and social tags. The disadvantages of manual annotation will be remedied through such relationships. The discovered associations were then used to identify social tag spams. Preliminary experiments show that the quality and usability of social tags were improved through this method.

[1]  Georgia Koutrika,et al.  Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges , 2007, IEEE Internet Computing.

[2]  Georgia Koutrika,et al.  Combating spam in tagging systems , 2007, AIRWeb '07.

[3]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[4]  Vittorio Loreto,et al.  Network properties of folksonomies , 2007, AI Commun..

[5]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[6]  Thomas Lavergne,et al.  Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[7]  Andreas Hotho,et al.  The anti-social tagger: detecting spam in social bookmarking systems , 2008, AIRWeb '08.

[8]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[9]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[11]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Garcia-MolinaHector,et al.  Combating spam in tagging systems , 2008 .

[15]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[16]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.