Web Entities Extraction Based on Semi-Structured Semantic Database

Web is the biggest source of information and contains many entities and relationships between them, extracting these data from Massive Web pages and Integrating to a Semi-Structured Data with rich semantics will be more conducive to the management and use of these web data. On this premise, a comprehensive method is proposed to perform extraction the entities and relationships from the webpages. The method consists of two steps: 1) The target Web pages which contains these entities will be found based on the combination of vision information and content of keyword, meanwhile recording the relationship between father and children target Web pages; 2) Extracting the entities with analysis of DOM tree structure of the obtained Web pages and definitions of some extraction rules. At last, the extracted data is organized into a Semi-Structured Data with special relationships. Experiments on a large number of HTML pages have showed that this method can get a high correct rate and coverage.

[1]  R. Singhal,et al.  The use of gum arabic and modified starch in the microencapsulation of a food flavoring agent , 2005 .

[2]  Qiang Wang,et al.  Ontology-Based Focused Crawling , 2009, 2009 International Conference on Information, Process, and Knowledge Management.

[3]  Wanli Zuo,et al.  First-order focused crawling , 2007, WWW '07.

[4]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[5]  C. Gerhards,et al.  Stabilization of emulsions by OSA starches , 2002 .

[6]  Renu Vig,et al.  Design of CORE: context ontology rule enhanced focused web crawler , 2009, ICAC3 '09.

[7]  Filippo Menczer,et al.  Topic-Driven Crawlers: Machine Learning Issues , 2002 .

[8]  Denilson Barbosa,et al.  Adaptive record extraction from web pages , 2007, WWW '07.

[9]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[10]  Wei Huang,et al.  Focused Crawling for Retrieving E-commerce Information Based on Learnable Ontology and Link Prediction , 2009, 2009 International Symposium on Information Engineering and Electronic Commerce.

[11]  Susan Gauch,et al.  A Cooperative Approach to Web Crawler URL Ordering , 2012 .

[12]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[13]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[14]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[15]  Babak Bagheri Hariri,et al.  A Method for Focused Crawling Using Combination of Link Structure and Content Similarity , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[16]  Lina Lan,et al.  A Novel Lightweight Main Memory Database for Telecom Network Performance Management System , 2012, J. Networks.

[17]  Mengchi Liu,et al.  Modeling Complex Relationships , 2009, DEXA.

[18]  A. Viswanathan Effect of Degree of Substitution of Octenyl Succinate Starch on the Emulsification Activity on Different Oil Phases , 1999 .

[19]  M. Michel,et al.  Stability of emulsions containing sodium caseinate and dextran sulfate: Relationship to complexation in solution , 2008 .

[20]  Tobias Anton XPath-Wrapper Induction by generating tree traversal patterns , 2005, LWA.

[21]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[22]  Weiyi Meng,et al.  Vision-based Web Data Records Extraction , 2006, WebDB.

[23]  M. Gremião,et al.  Gelatin microparticles containing propolis obtained by spray-drying technique: preparation and characterization. , 2003, International journal of pharmaceutics.

[24]  Xiongfei Li,et al.  Adaptive Feature Selection and Extraction Approaches for Image Retrieval based on Region , 2010, J. Multim..

[25]  Ahmed Patel,et al.  Application of structured document parsing to focused web crawling , 2011, Comput. Stand. Interfaces.

[26]  Yang Gao,et al.  An efficient adaptive focused crawler based on ontology learning , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[27]  Hongjun Lu,et al.  SG-WRAP: a schema-guided wrapper generator , 2002, Proceedings 18th International Conference on Data Engineering.

[28]  W. Kolanowski,et al.  Microencapsulation of fish oil by spray drying--impact on oxidative stability. Part 1 , 2006 .