Improving Relevance of Keyword Extraction from the Web Utilizing Visual Style Information

Information growth is faster than ever before. We need to provide advanced services facilitating information “consumption” (e.g., recommendation, personalized navigation). At least a lightweight semantics is necessary for such services. Nowadays keyword paradigm is widely used and seems to achieve satisfactory results in fields such as social bookmarking or ontology learning. In this paper we explore impact of web site visual style on relevant keywords extraction. We propose a method for relevant keywords extraction from web pages combining traditional automatic term recognition algorithms with web site’s visual style processing. We particularly focus on cascade style sheets. The evaluation conducted on 200 “wild” Web documents from 12 different web sites showed that our method increases the relevance of extracted keywords.

[1]  Yurdaer N. Doganata,et al.  Glossary extraction and utilization in the information search and delivery system for IBM Technical Support , 2004, IBM Syst. J..

[2]  Lee Gillam,et al.  University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) , 1999, TREC.

[3]  Jonathan Hodgson Do HTML Tags Flag Semantic Content? , 2001, IEEE Internet Comput..

[4]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[5]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6]  Paola Velardi,et al.  TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities , 2007, IESA.

[7]  Zdenek Zdrahal,et al.  Towards a framework for comparing automatic term recognition methods , 2009 .

[8]  Mária Bieliková,et al.  Enhancing automatic term recognition algorithms with HTML tags processing , 2011, CompSysTech '11.

[9]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[10]  Mária Bieliková,et al.  Ordinary Web pages as a source for metadata acquisition for open corpus user modeling , 2010 .

[11]  Louis B. Rosenfeld,et al.  Web Style Guide: Basic Design Principles for Creating Web Sites , 1999 .

[12]  Mária Bieliková,et al.  Utilizing Microblogs for Web Page Relevant Term Acquisition , 2013, SOFSEM.

[13]  Philipp Cimiano,et al.  Ontology learning and population from text - algorithms, evaluation and applications , 2006 .