An unsupervised framework for extracting and normalizing product attributes from multiple web sites

We have developed an unsupervised framework for simultaneously extracting and normalizing attributes of products from multiple Web pages originated from different sites. Our framework is designed based on a probabilistic graphical model that can model the page-independent content information and the page-dependent layout information of the text fragments in Web pages. One characteristic of our framework is that previously unseen attributes can be discovered from the clue contained in the layout format of the text fragments. Our framework tackles both extraction and normalization tasks by jointly considering the relationship between the content and layout information. Dirichlet process prior is employed leading to another advantage that the number of discovered product attributes is unlimited. An unsupervised inference algorithm based on variational method is presented. The semantics of the normalized attributes can be visualized by examining the term weights in the model. Our framework can be applied to a wide range of Web mining applications such as product matching and retrieval. We have conducted extensive experiments from four different domains consisting of over 300 Web pages from over 150 different Web sites, demonstrating the robustness and effectiveness of our framework.

[1]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  Bo Zhang,et al.  Webpage understanding: an integrated approach , 2007, KDD '07.

[3]  Rayid Ghani,et al.  Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions , 2007, IJCAI.

[4]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[5]  Clement T. Yu,et al.  Mining templates from search result records of search engines , 2007, KDD '07.

[6]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[7]  Alicia Ageno,et al.  Adaptive information extraction , 2006, CSUR.

[8]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[9]  Wai Lam,et al.  Adapting Web information extraction knowledge via mining site-invariant and site-dependent features , 2007, TOIT.

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[12]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[13]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[14]  Andrew McCallum,et al.  An Integrated, Conditional Model of Information Extraction and Coreference with Appli , 2004, UAI.

[15]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[16]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  Andrew McCallum,et al.  A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models , 2003 .

[19]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .