Domain Topic and Hidden Deep Web Data Extracting

This paper mainly studies the method of extracting web data entities based on domain. Through the analysis of real estate industry websites, a topic-oriented topic extracting model is proposed, and the corresponding search strategy is given. In addition, for the case of depth information, a sorting-based classification extraction algorithm is designed for numerical data. Finally, an experimental example is given to verify the effectiveness of the algorithm. Introduction With the development of human society, people's demand for information interaction is increasing, and the Internet has emerged [1]. With the help of the Internet, information can be quickly disseminated, and the types of information are increasing, including documents, pictures, videos, audios, hyperlinks, forms, and so on [1,2,3]. People's demand for information has also grown. The increase in the demand for Web information extraction technology and the in-depth study of the corresponding research work have promoted the development of Web information extraction technology. At present, various types of web information extraction tools and methods have emerged [4,5,6] . Although most of these tools and systems use web page wrappers to ultimately achieve the acquisition of structured data in a data source, the methodologies used and the areas of research involved are not the same. According to the principle of the method of identifying and locating user's data in web pages, people have roughly classified various web information extraction systems and related technologies. The main types are: ontology based extraction, location based extraction, NLP based extraction, wrapper modeling based extraction, web query based information extraction and so on [7,8,9,10]. However, most of methods applied for extracting web data did not consider the domain requirements, on the other hand, a lot of useful field data are stored in the background database, which belong to hidden deep data, and need continuous query and extraction. Motivated by this, we studies the methods of extracting web data entities based on domains knowledge, and propose a topic-oriented extracting model, for the existence of depth information on the domain webpages, a sorting classification extraction algorithm was designed for numerical data. Domain Topic Extracting Model and Searching Strategy Domain Topic Extracting Model. The domain topic extracting model of this article has been improved on the basis of the generic crawler model, and the flowchart of the extracting model used in this paper is shown in the Fig .1. Compared with the general crawler model, the domain topic model has two more modules: the page topic relevance calculation module and the candidate URL priority calculation module. The page topic relevance calculation module may filter the saved pages according to the relevance of the pages and the topics. If the relevance of the page to the topic is higher than the set threshold, the candidate URL of the page is extracted and input into the candidate URL priority calculation module, and the calculation rules are as follows: If the candidate URL is relatively related to the topic, it is inserted. To the front of the queue, the opposite is inserted into the back of the queue or is discarded. 3rd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2018) Copyright © 2018, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). Advances in Engineering Research, volume 166