TERM WEIGHTING BASED ON INDEX OF GENRE FOR WEB PAGE GENRE CLASSIFICATION

Automating the identification of the genre of web pages becomes an important area in web pages classification, as it can be used to improve the quality of the web search result and to reduce search time. To index the terms used in classification, generally the selected type of weighting is the document-based TF-IDF. However, this method does not consider genre, whereas web page documents have a type of categorization called genre. With the existence of genre, the term appearing often in a genre should be more significant in document indexing compared to the term appearing frequently in many genres despites its high TF-IDF value. We proposed a new weighting method for web page documents indexing called inverse genre frequency (IGF). This method is based on genre, a manual categorization done semantically from previous research. Experimental results show that the term weighting based on index of genre (TF-IGF) performed better compared to term weighting based on index of document (TF-IDF), with the highest value of accuracy, precision, recall, and F-measure in case of excluding the genre-specific keywords were 78%, 80.2%, 78%, and 77.4% respectively, and in case of including the genre-specific keywords were 78.9%, 78.7%, 78.9%, and 78.1% respectively.

[1]  Efstathios Stamatatos,et al.  Open-Set Classification for Automated Genre Identification , 2013, ECIR.

[2]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Ela Kumar,et al.  An Integrated and Improved Approach to Terms Weighting in Text Classification , 2013 .

[4]  M. Arif Wani,et al.  A Multi-label and Adaptive Genre Classification of Web Pages , 2012, 2012 11th International Conference on Machine Learning and Applications.

[5]  Katja Markert,et al.  Fine-Grained Genre Classification Using Structural Learning Algorithms , 2010, ACL.

[6]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[7]  Lei Dong,et al.  An Examination of Genre Attributes for Web Page Classification , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[8]  Peter Willett,et al.  The Porter stemming algorithm: then and now , 2006, Program.

[9]  A. Venugopal Reddy,et al.  Performance Improvement of Web Page Genre Classification , 2012 .

[10]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[11]  I. Maqsood,et al.  Random Forests and Decision Trees , 2012 .

[12]  Matjaz Gams,et al.  Training the Genre Classifier for Automatic Classification of Web Pages , 2007, 2007 29th International Conference on Information Technology Interfaces.

[13]  P ORTER STEMMER A New Stemmer to Improve Information Retrieval , 2013 .

[14]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[15]  Wahiba Ben,et al.  A NEW STEMMER TO IMPROVE INFORMATION RETRIEVAL , 2013 .

[16]  Benno Stein,et al.  Genre Classification of Web Pages , 2004, KI.

[17]  Peter Wittenburg,et al.  Improving Native Language Identification with TF-IDF Weighting , 2013, BEA@NAACL-HLT.

[18]  Malcolm I. Heywood,et al.  Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering , 2005 .

[19]  Mu Zhu,et al.  Kernels and Ensembles : Perspectives on Statistical Learning , 2008 .

[20]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[21]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[22]  Guangyu Chen,et al.  Web page genre classification , 2008, SAC '08.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[24]  Fuji Ren,et al.  Class-indexing-based term weighting for automatic text classification , 2013, Inf. Sci..