Web classification of conceptual entities using co-training

Social networking websites, which profile objects with predefined attributes and their relationships, often rely heavily on their users to contribute the required information. We, however, have observed that many web pages are actually created collectively according to the composition of some physical or abstract entity, e.g., company, people, and event. Furthermore, users often like to organize pages into conceptual categories for better search and retrieval, making it feasible to extract relevant attributes and relationships from the web. Given a set of entities each consisting of a set of web pages, we name the task of assigning pages to the corresponding conceptual categories conceptual web classification. To address this, we propose an entity-based co-training (EcT) algorithm which learns from the unlabeled examples to boost its performance. Different from existing co-training algorithms, EcT has taken into account the entity semantics hidden in web pages and requires no prior knowledge about the underlying class distribution which is crucial in standard co-training algorithms used in web classification. In our experiments, we evaluated EcT, standard co-training, and other three non co-training learning methods on Conf-425 dataset. Both EcT and co-training performed well when compared to the baseline methods that required large amount of training examples.

[1]  Craig A. Knoblock,et al.  Active Learning with Strong and Weak Views: A Case Study on Wrapper Induction , 2003, IJCAI.

[2]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[3]  Wei-Ying Ma,et al.  Building a web thesaurus from web link structure , 2003, SIGIR.

[4]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[5]  Ee-Peng Lim,et al.  Web unit-based mining of homepage relationships , 2006, J. Assoc. Inf. Sci. Technol..

[6]  Anoop Sarkar,et al.  Corrected Co-training for Statistical Parsers , 2003 .

[7]  David M. Pennock,et al.  Using web structure for classifying and describing web pages , 2002, WWW.

[8]  Min-Yen Kan,et al.  Stylistic and lexical co-training for web block classification , 2004, WIDM '04.

[9]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[10]  Wilfred Ng,et al.  Applying Co-training to Clickthrough Data for Search Engine Adaptation , 2004, DASFAA.

[11]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[12]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[13]  Adam Kowalczyk,et al.  Combining clustering and co-training to enhance text classification using unlabelled data , 2002, KDD.

[14]  Rada Mihalcea,et al.  Co-training and Self-training for Word Sense Disambiguation , 2004, CoNLL.

[15]  Ee-Peng Lim,et al.  Web unit mining: finding and classifying subgraphs of web pages , 2003, CIKM '03.

[16]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[17]  Craig A. Knoblock,et al.  Adaptive View Validation: A First Step Towards Automatic View Detection , 2002, ICML.

[18]  Anurag,et al.  Applying Co-training to Click through Data for Search Engine Adaptation : , .

[19]  Min-Yen Kan,et al.  Fast webpage classification using URL features , 2005, CIKM '05.

[20]  Craig A. Knoblock,et al.  Active + Semi-supervised Learning = Robust Multi-View Learning , 2002, ICML.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[23]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[24]  Nenghai Yu,et al.  Mutually beneficial learning with application to on-line news classification , 2007, PIKM '07.

[25]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[26]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[27]  K. Selçuk Candan,et al.  Reasoning for Web document associations and its applications in site map construction , 2002, Data Knowl. Eng..

[28]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[30]  G. Wahba Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV , 1999 .

[31]  William W. Cohen Improving a Page Classifier with Anchor Extraction and Link Analysis , 2002, NIPS.

[32]  Ee-Peng Lim,et al.  Web classification using support vector machine , 2002, WIDM '02.

[33]  Jian Su,et al.  A Collaborative Ability Measurement for Co-training , 2004, IJCNLP.