Building Quality-Based Views of the Web

Due to the fast growing of the information available on the Web, the retrieval of relevant content is increasingly hard. The complexity of the task is concerned both with the semantics of contents and with the filtering of quality-based sources. A recent strategy addressing the overwhelming amount of information is to focus the search on a snapshot of internet, namely a Web view. In this paper, we present a system supporting the creation of a quality-based view of the Web. We give a brief overview of the software and of its functional architecture. More emphasis is on the role of AI in supporting the organization of Web resources in a hierarchical structure of categories. We survey our recent works on document classifiers dealing with a twofold challenge. On one side, the task is to recommend classifications of Web resources when the taxonomy does not provide examples of classification, which usually happens when taxonomies are built from scratch. On the other side, even when taxonomies are populated, classifiers are trained with few examples since usually when a category achieves a certain amount of Web resources the organization policy suggests a refinement of the taxonomy. The paper includes a short description of a couple of case studies where the system has been deployed for real world applications.

[1]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[2]  Diego Sona,et al.  Multi-Classification of Clinical Guidelines in Concept Hierarchies , 2005 .

[3]  Diego Sona,et al.  Clustering documents into a web directory for bootstrapping a supervised classification , 2005, Data Knowl. Eng..

[4]  Ke Wang,et al.  Building Hierarchical Classifiers Using Class Proximity , 1999, VLDB.

[5]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, NY, USA, May 25-28, 2005, Proceedings , 2005, ISMIS.

[6]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[7]  Thomas Hofmann,et al.  Learning with Taxonomies: Classifying Documents and Words , 2003 .

[8]  Ian H. Witten,et al.  The bubble of web visibility , 2005, CACM.

[9]  Diego Sona,et al.  Regularization for Unsupervised Classification on Taxonomies , 2006, ISMIS.

[10]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .

[11]  Michelangelo Ceci,et al.  Hierarchical Classification of HTML Documents with WebClassII , 2003, ECIR.

[12]  Diego Sona,et al.  Clustering with Propagation for Hierarchical Document Classification , 2004 .

[13]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[14]  Claudio Gentile,et al.  Hierarchical classification: combining Bayes with SVM , 2006, ICML.

[15]  Diego Sona,et al.  Hierarchical Dirichlet model for document classification , 2005, ICML.