Authoring of Personalized Web Page from Heterogeneous Web Pages by Content Extraction and Integration

Authoring of personalized Web page by integrating heterogeneous Web page elements from different sites is a challenging task in Web 2.0 applications. An approach to extract various of partitions or elements, which can be the basic HTML elements, CSS definitions, JavaScript source code, etc, from different Web sites, thus implementing authoring of new page from heterogeneous Web pages is proposed in this paper. A novel DOM tree model, CS-DOM tree, is introduced to retrieve the CSS definitions. In order to assure that the new Web pages keep updating synchronized with the source pages, a method based on the structure of DOM and the context of elements to relocate the elements that have been retrieved before is then presented. The similarity calculation algorithm used to judge whether the relocated elements and the elements retrieved before are from the same position is developed at last. The method proposed in this paper has been applied to develop a personalized portal. Introduction In Web 2.0 applications the users are not only information consumers, but also information producers. So, allowing the reader to author Web pages and create personalized portals is necessary for Web 2.0. This requires the retrieval of information from heterogeneous Web pages. Normally, information retrieval of Web pages comes in two situations. One is the search engines, which classifies Web pages into multiple kinds of topics. When searching certain topics, the search engines can get relevant pages and then shows them to the users. The other is noises cleaning, which divides Web page into several blocks and identify whether a block is noise such as the navigation panels, copyright and advertisements, and then eliminates them from the Web page. In the above two situations, the target Web pages are handled as a whole. But with the requirement of giving more personalized services by the Web 2.0 applications, the integration of certain parts that belong to the different Web pages into new pages is getting more and more important. Because a part of a Web page corresponds to some elements in the HTML, the element level of Web pages must be focused in this situation. It is required to wave these fine grained elements into the personalized Web contents to meet the needs of individual users. The element level of authoring of Web pages mainly reflects in the extraction of the HTML tags and their attributes, CSS information, and JavaScript source code in the elements. Besides, the system should update the new pages to ensure their synchronization with the pages where the elements of the new page come from. This paper will focus on the solution of these problems. 734 Advances in Computer Science Research (ACRS), volume 54 International Conference on Computer Networks and Communication Technology (CNCT2016)

[1]  Jun Gao,et al.  Using XPath to Discover Informative Content Blocks of Web Pages , 2007 .

[2]  Ming Zhang,et al.  Data Extraction Based on Index Path in Web , 2010, 2010 Second International Workshop on Education Technology and Computer Science.

[3]  Kyuseok Shim,et al.  TEXT: Automatic Template Extraction from Heterogeneous Web Pages , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4]  Jie Yang,et al.  A Novel Method to Extract Informative Blocks from Web Pages , 2009, 2009 International Joint Conference on Artificial Intelligence.

[5]  Joongmin Choi,et al.  Web Information Extraction by HTML Tree Edit Distance Matching , 2007, 2007 International Conference on Convergence Information Technology (ICCIT 2007).