AUTOMATIC GENRE CLASSIFICATION OF HOME PAGES ON THE WEB

Abstract The World Wide Web contains many web pages, consisting of a wide variety of genres.Finding a method to automatically determine the genre of these web pages couldgreatly improve the search results of search engines on the web. Knowledge aboutthe genres of web pages could be used to allow a user to search for web pages withinparticular genres. A web page’s genre could also be used for ranking the web page incases where one genre is know to be more valuable to a particular query than anothergenre. One of the largest genres of web pages is the home page genre. The homepage genre can be thought of as a hierarchy that includes the sub-genres, personalhome page, corporate home page and organization home page. A web page’s genrecan be identified by extracting features from the page and then using those featuresto determine the web page’s genre. In this research, feature sets are used to train andtest a neural network for genre classification. The neural network will classify webpages as personal, corporate and organization home pages.viii

[1]  Kevin Crowston,et al.  The effects of linking on genres of Web documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[2]  Jussi Karlgren,et al.  Iterative Information Retrieval Using Fast Clustering and Usage-Specific Genres , 1999 .

[3]  Kevin Crowston,et al.  Reproduced and emergent genres of communication on the World-Wide Web , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[4]  Ingrid de Saint-Georges Click Here if You Want to Know Who I Am: Deixis in Personal Homepages , 1998, HICSS.

[5]  Sung-Hyon Myaeng,et al.  Automatic identification of text genres and their roles in subject-based categorization , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[6]  Kevin Crowston,et al.  A framework for creating a facetted classification for genres: addressing issues of multidimensionality , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[7]  Jussi Karlgren,et al.  Web-Specific Genre Visualization , 1998, WebNet.

[8]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[9]  Sung-Hyon Myaeng,et al.  Text genre classification with genre-revealing and subject-revealing features , 2002, SIGIR '02.

[10]  Kevin Crowston,et al.  Genre based navigation on the Web , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[11]  Carina Ihlström Eriksson,et al.  Genre characteristics - a front page analysis of 85 Swedish online newspapers , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[12]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[13]  Michael A. Shepherd,et al.  The functionality attribute of cybergenres , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[14]  Andreas Rauber,et al.  Integrating automatic genre analysis into digital libraries , 2001, JCDL '01.

[15]  Georg Rehm Towards Automatic Web Genre Identification , 2002, HICSS.

[16]  Carina Ihlström Eriksson,et al.  Evolution of the Web news genre-the slow move beyond the print metaphor , 2000, Proceedings of the 33rd Annual Hawaii International Conference on System Sciences.

[17]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..