Confidence-Based Incremental Classification for Objects with Limited Attributes in Vertical Search

With vertical search engines, it is possible to search the web pages on a specific domain such as products, restaurants or academic papers and present the users only the interested information. Gathering and integrating such objects from multiple web pages into a single system provides a useful facility for users. Placing the extracted objects from multiple data sources into a single hierarchical structure is a challenging classification problem, especially if there are limited object attributes. In this work, we propose a confidence-based incremental Naive Bayesian approach for categorization, focusing on the product domain. Incremental approach is based on extending the training set and retraining the classifier as new objects are assigned to a category with high confidence. The ordering of product data is taken into account as well. The proposed approach is applied on a vertical search engine that collects product data from several online stores.

[1]  Michel C. A. Klein,et al.  GoldenBullet: Automated Classification of Product Data in E-commerce , 2002 .

[2]  Ron Kohavi,et al.  Improving simple Bayes , 1997 .

[3]  Yong Zhao,et al.  OfCourse: web content discovery, classification and information extraction for online course materials , 2009, CIKM.

[4]  Nir Friedman,et al.  Sequential Update of Bayesian Network Structure , 1997, UAI.

[5]  Václav Snásel,et al.  Web Content Mining Focused on Named Objects , 2009, IHCI.

[6]  Ramakrishnan Srikant,et al.  On integrating catalogs , 2001, WWW '01.

[7]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[8]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[9]  Wen-tau Yih,et al.  Raising the baseline for high-precision text classifiers , 2007, KDD '07.

[10]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[11]  Erhard Rahm,et al.  Towards Large-Scale Schema and Ontology Matching , 2011, Schema Matching and Mapping.

[12]  Andrew McCallum,et al.  A unified approach for schema matching, coreference and canonicalization , 2008, KDD.

[13]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[14]  João Gama,et al.  Iterative Bayes , 2000, Intell. Data Anal..

[15]  Roberto J. Bayardo,et al.  Athena: Mining-Based Interactive Management of Text Database , 2000, EDBT.

[16]  Frank Klawonn,et al.  Evolving Extended Naive Bayes Classifiers , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[17]  Wei-Ying Ma,et al.  Object-level Vertical Search , 2007, CIDR.

[18]  Peter C. Lockemann,et al.  Advances in Database Technology — EDBT 2000 , 2000, Lecture Notes in Computer Science.

[19]  Gerhard Widmer,et al.  Machine Learning: ECML-97 , 1997, Lecture Notes in Computer Science.

[20]  Euripides G. M. Petrakis,et al.  Improving the performance of focused web crawlers , 2009, Data Knowl. Eng..