Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification

This research concerns the development of web content detection systems that will be able to automatically classify any web page into pre-defined content categories. Our work is motivated by practical experience and observations that certain categories of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into account. To further improve the performance of detection systems, we bring web sentiment features into classification models. In addition, we incorporate n-gram representation into our classification approach, based on the assumption that n-grams can capture more local context information in text, and thus could help to enhance topic similarity analysis. Different from most studies that only consider presence or frequency count of n-grams in their applications, we make use of tf-idf weighted n-grams in building the content classification models. Our result shows that unigram based models, even though a much simpler approach, show their unique value and effectiveness in web content classification. Higher order n-gram based approaches, especially 5-gram based models that combine topic similarity features with sentiment features, bring significant improvement in precision levels for the Violence and two Racism related web categories.

[1]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[2]  Mike Thelwall,et al.  Sentiment strength detection for the social web , 2012, J. Assoc. Inf. Sci. Technol..

[3]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[4]  Qiang Yang,et al.  Noise reduction through summarization for Web-page classification , 2007, Inf. Process. Manag..

[5]  Shuhua Liu,et al.  Web Content Classification based on Topic and Sentiment Analysis of Text , 2014, KDIR.

[6]  Weiming Hu,et al.  A Novel Web Page Filtering System by Combining Texts and Images , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[7]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[8]  Yue Lu,et al.  Automatic construction of a context-aware sentiment lexicon: an optimization approach , 2011, WWW.

[9]  William W. Cohen Improving a Page Classifier with Anchor Extraction and Link Analysis , 2002, NIPS.

[10]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[11]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[12]  Mike Thelwall,et al.  Sentiment in short strength detection informal text , 2010 .

[13]  Nivio Ziviani,et al.  Link-based similarity measures for the classification of Web documents , 2006 .

[14]  Johannes Fürnkranz,et al.  Exploiting Structural Information for Text Classification on the WWW , 1999, IDA.

[15]  Bin Cao,et al.  Short text classification by detecting information path , 2013, CIKM.

[16]  Abraham Kandel,et al.  Content-Based Detection of Terrorists Browsing the Web Using an Advanced Terror Detection System (ATDS) , 2005, ISI.

[17]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[18]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[19]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[20]  Wai Lam,et al.  MEAD - A Platform for Multidocument Multilingual Text Summarization , 2004, LREC.

[21]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[22]  Yue Lu,et al.  Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA , 2011, Information Retrieval.

[23]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[24]  Jana Kludas,et al.  Multimedia retrieval and classification for web content , 2007 .

[25]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[26]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[27]  Liming Chen,et al.  WebGuard: Web based adult content detection and filtering system , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[28]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[29]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[30]  Abraham Kandel,et al.  Content-Based Methodology for Anomaly Detection on the Web , 2003, AWIC.