EXTRACTION OF ATTRIBUTES AND VALUES FROM ONLINE TEXTS

Web documents contain vast amounts of information that can be extracted and processed to enhance the understanding of online data. Often, the structure of the document can be exploited in order to identify useful information within it. Pairs of attributes and their corresponding values are one such example of information frequently found in many online retail websites. These concentrated bits of information are often enclosed in specific tags of the web document, or highlighted with certain markers which can be automatically discovered and identified. This way, different methods can be employed to extract new pairs from other, more or less similar, documents. The method presented in this paper relies on the DOM (Document Object Model) structure and the text within web pages in order to extract patterns consisting of tags and pieces of text and then to classify them. Several classifiers have been compared and the best results have been obtained with a C4.5 decision tree classifier.

[1]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[4]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[5]  Minoru Sasaki,et al.  Rule-based text categorization using hierarchical categories , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[6]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[7]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[8]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[9]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[10]  Rayid Ghani,et al.  Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions , 2007, IJCAI.

[11]  Sujith Ravi,et al.  Using structured text for large-scale attribute extraction , 2008, CIKM '08.

[12]  Wai Lam,et al.  An unsupervised framework for extracting and normalizing product attributes from multiple web sites , 2008, SIGIR '08.

[13]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[14]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[15]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[16]  Haixun Wang,et al.  Probase: a probabilistic taxonomy for text understanding , 2012, SIGMOD Conference.

[17]  Jeong-Hoon Lee,et al.  Leveraging spatial join for robust tuple extraction from web pages , 2014, Information Sciences.