Automatic Identification of Home Pages on the Web

The research reported in this paper is the first phase of a larger project on the automatic classification of Web pages by their genres. The long term goal is the incorporation of web page genre into the search process to improve the quality of the search results. In this phase, a neural net classifier was trained to distinguish home pages from non-home pages and to classify those home pages as personal home page, corporate home page or organization home page. Results indicate that the classifier is able to distinguish home pages from non-home pages and within the home page genre it is able to distinguish personal from corporate home pages. Organization home pages, however, were more difficult to distinguish from personal and corporate home pages.

[1]  Kevin Crowston,et al.  Identifying Document Genre to Improve Web Search Effectiveness , 2005 .

[2]  Hot and Cool , 1988, IEEE Softw..

[3]  W. Orlikowski,et al.  Genres of Organizational Communication: A Structurational Approach to Studying Communication and Media , 1992 .

[4]  Kevin Crowston,et al.  Reproduced and Emergent Genres of Communication on the World Wide Web , 2000, Inf. Soc..

[5]  Michael A. Shepherd,et al.  The evolution of cybergenres , 1998, Proceedings of the Thirty-First Hawaii International Conference on System Sciences.

[6]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[7]  Sung-Hyon Myaeng,et al.  Automatic identification of text genres and their roles in subject-based categorization , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[8]  Michael A. Shepherd,et al.  The functionality attribute of cybergenres , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[9]  Laurie Bauer Hitting a moving target , 2002 .

[10]  Georg Rehm Towards Automatic Web Genre Identification , 2002, HICSS.

[11]  Thomas Erickson Social interaction on the net: virtual community or participatory genre? , 1997, SIGG.

[12]  M. Lynn Hawaii International Conference on System Sciences , 1996 .

[13]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[14]  T. Erickson,et al.  Social interaction on the Net: virtual community as participatory genre , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[15]  Adena Rosmarin,et al.  The power of genre , 1985 .

[16]  Kevin Crowston,et al.  Genre based navigation on the Web , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[17]  Georg Rehm,et al.  Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[18]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[19]  JoAnne Yates,et al.  Collaborative genres for collaboration: genre systems in digital media , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[20]  Andrew Dillon,et al.  Genres and the Web: is the personal home page the first uniquely digital genre? , 2000 .

[21]  Andrew Dillon,et al.  Genres and the Web - is the home page the first digital genre? , 2000 .

[22]  Kevin Crowston,et al.  A framework for creating a facetted classification for genres: addressing issues of multidimensionality , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[23]  Carol Van Ess-Dykema,et al.  The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[24]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[25]  Michael Shepherd,et al.  Identifying Web Genre: Hitting A Moving Target , 2004 .

[26]  Efstathios Stamatatos,et al.  Text Genre Detection Using Common Word Frequencies , 2000, COLING.