Learning to recognize webpage genres

Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user's information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language-independent and easily-extracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus.

[1]  Pavel Braslavski Combining Relevance and Genre-Related Rankings : an Exploratory Study , 2007 .

[2]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[3]  Marina Santini Zero, single, or multi? Genre of web pages through the users' perspective , 2008, Inf. Process. Manag..

[4]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[5]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6]  Alexander Mehler,et al.  Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems , 2008, LREC.

[7]  Lei Dong,et al.  An Examination of Genre Attributes for Web Page Classification , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[8]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[9]  Michael A. Shepherd,et al.  The evolution of cybergenres , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[10]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[11]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[12]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.

[13]  Aidan Finn,et al.  Learning to classify documents according to genre: Special Topic Section on Computational Analysis of Style , 2006 .

[14]  Stuart N. K. Watt,et al.  Classifying XML Documents by Using Genre Features , 2007, 18th International Workshop on Database and Expert Systems Applications (DEXA 2007).

[15]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[16]  Günther Palm,et al.  KI 2004: Advances in Artificial Intelligence , 2004, Lecture Notes in Computer Science.

[17]  Vedrana Vidulin,et al.  Using Genres to Improve Search Engines , 2007 .

[18]  Lei Yu,et al.  Using Visual Features for Fine-Grained Genre Classification of Web Pages , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[19]  Yunhyong Kim,et al.  Examining Variations of Prominent Features in Genre Classification , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[20]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[21]  James Allan,et al.  Combinatorial markov random fields and their applications to information organization , 2008 .

[22]  Efstathios Stamatatos,et al.  N-Gram Feature Selection for Authorship Identification , 2006, AIMSA.

[23]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[24]  Benno Stein,et al.  Genre classification of Web pages user study and feasibility analysis , 2004 .

[25]  Sven Meyer Genre Classification of Web Pages User Study and Feasibility Analysis , 2004 .

[26]  Sven Meyer zu Eissen,et al.  On Information Need and Categorizing Search , 2007, Künstliche Intell..

[27]  Carol Van Ess-Dykema,et al.  The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[28]  Mark A. Rosso Using genre to improve web search , 2005 .

[29]  John M. Swales,et al.  Genre Analysis: English in Academic and Research Settings , 1993 .

[30]  Gil-Chang Kim,et al.  Multiple sets of features for automatic genre classification of web documents , 2005, Inf. Process. Manag..

[31]  Adele E. Howe,et al.  Effects of web document evolution on genre classification , 2005, CIKM '05.

[32]  Michael A. Shepherd,et al.  An N-Gram Based Approach to Automatically Identifying Web Page Genre , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[33]  José Gabriel Pereira Lopes,et al.  Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units , 1999, EPIA.

[34]  Alistair Kennedy,et al.  Automatic Identification of Home Pages on the Web , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[35]  Sven Meyer On Information Need and Categorizing Search , 2007 .

[36]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .