Automatic identification of genre in Web pages

Genre is a complex but intuitively understood concept. Home pages, FAQs, blogs, etc. are examples of genres currently thriving on the web. Automatically identifying web genres would help us find documents that are more relevant to our information needs. The aim of the research described in this book is to develop automatic genre classification algorithms. There are several challenges, however, that affect the modelling of these algorithms. First, genres on the web are instantiated in web pages, which can be considered documents of a new type, much more unpredictable and individualised than documents on paper. Second, the web is unstable and fluid, undergoing a fast-paced evolution, so genre identification is influenced by phenomena such as the formation of novel genres, genre hybridism, individualisation, intra-genre and inter-genre variation. Finally, the automatically extractable genre-revealing features used up to now are not adequate to define existing and novel web genres. The author argues that automatic identification of genre in web pages needs more flexible genre classification schemes. The main body of the book describes experiments that support this claim.