Automatic Extraction Rules Generation Based on XPath Pattern Learning

Web forums have become important information sources on the Web due to their rich content contributed by millions of Internet users every day. Data extraction from Web pages is a key but cumbersome step for data analysis because of significant human intervention. Web forums have fairly regular structures which allow us to generate extraction rules automatically according to their paths. In this paper, we introduce formal expressions for XPath patterns and pattern mapping rules, and advise machine learning methods to generate extraction rules for automatic data extraction from Web forums. The experimental results on real-life Web forums show good feasibility and accuracy for forum data.

[1]  Yida Wang,et al.  iRobot: an intelligent crawler for web forums , 2008, WWW.

[2]  Sara Cohen Generating XML structure using examples and constraints , 2008, Proc. VLDB Endow..

[3]  Gottfried Vossen,et al.  Web Information Systems Engineering - WISE 2009, 10th International Conference, Poznan, Poland, October 5-7, 2009. Proceedings , 2009, WISE.

[4]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[5]  Elio Masciari,et al.  Web wrapper induction: a brief survey , 2004, AI Commun..

[6]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[7]  Yi Chen,et al.  eXtract: a snippet generation system for XML search , 2008, Proc. VLDB Endow..

[8]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[9]  Jussi Myllymaki Effective Web data extraction with standard XML technologies , 2001, WWW '01.

[10]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[11]  Arnaud Sahuguet,et al.  Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F , 1999, VLDB.

[12]  Aristides Gionis,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD 2000.

[13]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[14]  Susan Mengel,et al.  Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model , 2009, WISE.