An unsupervised method for joint information extraction and feature mining across different Web sites

We develop an unsupervised learning framework which can jointly extract information and conduct feature mining from a set of Web pages across different sites. One characteristic of our model is that it allows tight interactions between the tasks of information extraction and feature mining. Decisions for both tasks can be made in a coherent manner leading to solutions which satisfy both tasks and eliminate potential conflicts at the same time. Our approach is based on an undirected graphical model which can model the interdependence between the text fragments within the same Web page, as well as text fragments in different Web pages. Web pages across different sites are considered simultaneously and hence information from different sources can be effectively leveraged. An approximate learning algorithm is developed to conduct inference over the graphical model to tackle the information extraction and feature mining tasks. We demonstrate the efficacy of our framework by applying it to two applications, namely, important product feature mining from vendor sites, and hot item feature mining from auction sites. Extensive experiments on real-world data have been conducted to demonstrate the effectiveness of our framework.

[1]  Wai Lam,et al.  Hot item mining and summarization from multiple auction Web sites , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[2]  Pier Luca Lanzi,et al.  Mining interesting knowledge from weblogs: a survey , 2005, Data Knowl. Eng..

[3]  Nicholas Kushmerick,et al.  Adaptive Information Extraction: Core Technologies for Information Agents , 2003, AgentLink.

[4]  Anne E. James,et al.  Sampling, information extraction and summarisation of Hidden Web databases , 2006, Data Knowl. Eng..

[5]  Wai Lam,et al.  A probabilistic approach for adapting information extraction wrappers and discovering new attributes , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Alexiei Dingli,et al.  Armadillo: harvesting information for the semantic web , 2004, SIGIR '04.

[7]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[8]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[9]  David W. Embley,et al.  Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages , 1999, Data Knowl. Eng..

[10]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[11]  Elio Masciari,et al.  Exploiting structural similarity for effective Web information extraction , 2007, Data Knowl. Eng..

[12]  Wai Lam,et al.  Text Mining from Site Invariant and Dependent Features for Information Extraction Knowledge Adaptation , 2004, SDM.

[13]  Andrew McCallum,et al.  An Integrated, Conditional Model of Information Extraction and Coreference with Appli , 2004, UAI.

[14]  Dan Klein,et al.  Unsupervised Learning of Field Segmentation Models for Information Extraction , 2005, ACL.

[15]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[16]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[17]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[18]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[21]  Nicholas Kushmerick,et al.  The Wrapper Induction Environment , 1998 .

[22]  Rayid Ghani,et al.  Price prediction and insurance for online auctions , 2005, KDD '05.

[23]  Fabio Ciravegna,et al.  (LP) 2 , an Adaptive Algorithm for Information Extraction from Web-related Texts , 2001 .

[24]  William W. Cohen,et al.  A flexible learning system for wrapping tables and lists in HTML documents , 2002, WWW.

[25]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[26]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[27]  Wai Lam,et al.  Collaborative Information Extraction and Mining from Multiple Web Documents , 2006, SDM.

[28]  Rayid Ghani,et al.  Semi-Supervised Learning of Attribute-Value Pairs from Product Descriptions , 2007, IJCAI.

[29]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[30]  Andrew McCallum,et al.  A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models , 2003 .

[31]  Valter Crescenzi,et al.  Automatic information extraction from large websites , 2004, JACM.

[32]  Sekhar C. Tatikonda,et al.  Convergence of the sum-product algorithm , 2003, Proceedings 2003 IEEE Information Theory Workshop (Cat. No.03EX674).

[33]  Andrew McCallum,et al.  Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference , 2003, IIWeb.

[34]  R. Ghani Predicting the End-Price of Online Auctions , 2004 .

[35]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.

[36]  Maurice Bruynooghe,et al.  Information extraction from structured documents using k-testable tree automaton inference , 2006, Data Knowl. Eng..

[37]  Andrew McCallum,et al.  Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text , 2006, NAACL.

[38]  Valter Crescenzi,et al.  Clustering Web pages based on their structure , 2005, Data Knowl. Eng..

[39]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[40]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[41]  Wai Lam,et al.  Extracting and Summarizing Hot Item Features Across Different Auction Web Sites , 2006, PAKDD.

[42]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[43]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[44]  Paul A. Viola,et al.  Learning to extract information from semi-structured text using a discriminative context free grammar , 2005, SIGIR '05.

[45]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[46]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[47]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[48]  Satoshi Morinaga,et al.  Mining product reputations on the Web , 2002, KDD.

[49]  Georgios Paliouras,et al.  Combining Information Extraction Systems Using Voting and Stacked Generalization , 2005, J. Mach. Learn. Res..