URL classification using non negative matrix factorization

Internet availability on a campus is not metered. Internet link bandwidths are vulnerable as they can be misused. Moreover, websites blacklist campuses for misuse. Especially blacklisting by academic websites like IEEE and ACM can lead to serious researchers being denied access to information. The objective of this paper is to proactively classify anomalous accesses. This will enable campus ISPs to deny access to users, misusing the Internet. In particular URLs are classified using the short snippets(meta-data) that are available. New Features, namely random walk term weights, within class popularity in tandem with non negative matrix factorization show a lot of promise for classifying URLs. The classification accuracy is as a high as 92.96% on 10 gigabytes of proxy data.

[1]  Hema A. Murthy,et al.  User traffic classification for proxy-server based internet access control , 2012, 2012 6th International Conference on Signal Processing and Communication Systems.

[2]  Zhijing Liu,et al.  A Novel Approach to Naive Bayes Web Page Automatic Classification , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[3]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[4]  Hongjun Lu,et al.  Cut-and-Pick Transactions for Proxy Log Mining , 2002, EDBT.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[7]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[8]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[9]  Stephanie W. Haas,et al.  Page and link classifications: connecting diverse resources , 1998, DL '98.

[10]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[11]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[12]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[13]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[14]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[15]  Ben Choi,et al.  Web Page Classification , 2005 .

[16]  Rada Mihalcea,et al.  Random Walk Term Weighting for Improved Text Classification , 2007, Int. J. Semantic Comput..

[17]  Koraljka Golub,et al.  Importance of HTML structural elements and metadata in automated subject classification , 2005 .

[18]  Hema A Murthy,et al.  Internet activity analysis through proxy log , 2010, 2010 National Conference On Communications (NCC).

[19]  Michael W. Berry,et al.  Algorithms and applications for approximate nonnegative matrix factorization , 2007, Comput. Stat. Data Anal..

[20]  Christina Lioma,et al.  Random walk term weighting for information retrieval , 2007, SIGIR.

[21]  George Karypis,et al.  Selective Markov models for predicting Web page accesses , 2004, TOIT.

[22]  Timothy A. Gonsalves,et al.  Feature Selection for Text Classification Based on Gini Coefficient of Inequality , 2010, FSDM.

[23]  Hongjun Lu,et al.  Efficient prediction of web accesses on a proxy server , 2002, CIKM '02.