Web Page Information Extraction Service Based on Graph Convolutional Neural Network and Multimodal Data Fusion

Information extraction and its service is a hot topic. Many works focus on extracting information from a certain web page and ignore the localization of the webpage which contains useful information. Nevertheless, developing a holistic system to extract information consists of locating a webpage and extracting information from that webpage, and these two steps are indispensable. For instance, extracting lecture news from universities' websites is a typical hard task that need to locate web pages and extract news information from them. Due to different layouts and visual appearances, statistic-based methods and visual based methods failed to find them. In this study, we propose an all-holistic method to locate lecture news on the university website. Graph Convolutional Network (GCN) is applied to fuse the multimodal data, which could learn useful features from different views, the linked relationship, the visual similarity, and the semantic of web pages. Firstly, we apply the link model to explore the parent-child relationship between web pages, then calculate the similarity of parent-child pages using a visual model and obtain the semantic features based on the BERT model. Specifically, the visual similarity features are learned based on triplet loss function which imposes the Convolutional Neural Network (CNN) model to learn similar parts in the same group. Lastly, these features are fused into the GCN model to find a certain webpage and it can be adaptive to various university websites. The experiments conducted on 50 websites show our method outperforms state-of-the-art.