Enhancing automatic term recognition algorithms with HTML tags processing

We focus on mining relevant information from web pages. Unlike plain text documents, web pages contain another source of potentially relevant information - easily processable mark-up. We propose an approach to keyword extraction that enhances Automatic Term Recognition (ATR) algorithms intended for processing plain text documents with an analysis of HTML tags present in the document. We distinguish tags that have a semantic potential. We present results of an experiment we conducted on a set of Wikipedia pages. It shows that enhancement yields better results than using ATR algorithms alone.

[1]  Darina Dicheva,et al.  Helping Courseware Authors to Build Ontologies: The Case of TM4L , 2007, AIED.

[2]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[3]  Zdenek Zdrahal,et al.  Towards a framework for comparing automatic term recognition methods , 2009 .

[4]  Mária Bieliková,et al.  Ordinary Web pages as a source for metadata acquisition for open corpus user modeling , 2010 .

[5]  Michal Barla Towards Social-based User Modeling and Personalization , 2010 .

[6]  I. V. Ramakrishnan,et al.  Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis , 2003, SEMWEB.

[7]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[8]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[9]  Paola Velardi,et al.  TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities , 2007, IESA.

[10]  Yurdaer N. Doganata,et al.  Glossary extraction and utilization in the information search and delivery system for IBM Technical Support , 2004, IBM Syst. J..

[11]  Lee Gillam,et al.  University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER) , 1999, TREC.

[12]  Jonathan Hodgson Do HTML Tags Flag Semantic Content? , 2001, IEEE Internet Comput..

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.