An empirical study on harmonizing classification precision using IE patterns

Web pages are conventionally represented by the words found within the contents for classification purpose. However, word-based web page representation suffers several limitations such as synonymy and homonymy. Motivated by the limitations of word-based representation, we explore the potential of representing web pages using information extraction patterns, in addition to words that are identified within the web contents. In this paper, we share the results as well as the findings learned from our experiments. Our empirical study conducted using WebKB dataset indicates that the addition of information extraction patterns in web page representation helps to improve the classification precision, especially in the categories which have much diversified web content.

[1]  Line Eikvil,et al.  Information Extraction from World Wide Web - A Survey , 1999 .

[2]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[3]  Ellen Riloff,et al.  A Case Study in Using Linguistic Phrases for Text Categorization on the WWW , 1998 .

[4]  Kevin Chen-Chuan Chang,et al.  Editorial: special issue on web content mining , 2004, SKDD.

[5]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[6]  Wei-Ying Ma,et al.  Web-page classification through summarization , 2004, SIGIR '04.

[7]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[8]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[9]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Hugh E. Williams,et al.  Fast Categorisation of Large Document Collections , 2001, SPIRE.

[12]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[13]  Jean-Cédric Chappelier,et al.  Using Information Extraction to Classify Newspapers Advertisements , 2000 .

[14]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[15]  Filippo Menczer,et al.  Crawling the Web , 2004, Web Dynamics.

[16]  Bing Liu Information Retrieval and Web Search , 2011 .

[17]  Ellen Riloff,et al.  An Introduction to the Sundance and AutoSlog Systems , 2011 .

[18]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[19]  Olfa Nasraoui,et al.  Web data mining: exploring hyperlinks, contents, and usage data , 2008, SKDD.

[20]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[21]  Osmar R. Zaïane,et al.  Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..