Liveclassifier: creating hierarchical text classifiers through web corpora

Many Web information services utilize techniques of information extraction(IE) to collect important facts from the Web. To create more advanced services, one possible method is to discover thematic information from the collected facts through text classification. However, most conventional text classification techniques rely on manual-labelled corpora and are thus ill-suited to cooperate with Web information services with open domains. In this work, we present a system named LiveClassifier that can automatically train classifiersthrough Web corpora based on user-defined topic hierarchies. Due to its flexibility and convenience, LiveClassifier can be easily adapted for various purposes. New Web information services can be created to fully exploit it; human users can use it to create classifiers for their personal applications. The effectiveness of classifiers created by LiveClassifier is well supportedby empirical evidence.

[1]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[2]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[3]  Claudio Carpineto,et al.  An information-theoretic approach to automatic query expansion , 2001, TOIS.

[4]  Karen Spärck Jones Notes and references on early automatic classification work , 1991, SIGF.

[5]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.

[6]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[7]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[8]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[9]  Shui-Lung Chuang,et al.  Towards automatic generation of query taxonomy: a hierarchical query clustering approach , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[10]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[11]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[12]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[13]  Craig A. Knoblock,et al.  STALKER: Learning Extraction Rules for Semistructured, Web-based Information Sources * , 1998 .

[14]  William W. Cohen,et al.  Learning Page-Independent Heuristics for Extracting Data from Web Pages , 1999, Comput. Networks.

[15]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[16]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[17]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[18]  Oren Etzioni,et al.  A scalable comparison-shopping agent for the World-Wide Web , 1997, AGENTS '97.

[19]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[20]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[21]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[22]  Hsi-Jian Lee,et al.  Anchor text mining for translation of Web queries , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[24]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[25]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[26]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.