论文信息 - Research on Content Extraction of Rich Text Web Pages

Research on Content Extraction of Rich Text Web Pages

Obtaining effective information from Web pages has become a hot topic in the Internet data processing industry. Based on this, this paper studies the extraction of rich text Web pages. First, this paper designs a method to extract rich text page titles and release times. After a large number of data sets statistics, it is found that almost all the body titles are in the title tag, so a tag-based regular matching algorithm is designed. By writing the code to test the algorithm, the average extraction accuracy is more than 80%. For the extraction of release time, a common regular expression is used to match the full text of a web page and filter according to the position difference of each timestamp in the web page. Set the rule to filter the first occurrence of the timestamp is the release time. Test the algorithm, and the average extraction accuracy of the local data set is over 75%. Secondly, this paper designs a method to extract the text of rich text web pages. On the basis of the training data set, the logical regression model, random forest model and support vector machine model were respectively trained through the cross validation of ten folds. Model fusion was performed on the three models to identify the class tag of the body. Again tested on the test data set, the experimental results show that the average accuracy is 95.6%, the recall rate index was 0.948, 0.923, f1 to further determine the web page text in class labels can be 100% extraction. Finally, this paper evaluates the proposed algorithm and proposes corresponding improvements.

[1] Yan Jia,et al. Bidirectional self-adaptive resampling in internet of things big data learning , 2018, Multimedia Tools and Applications.

[2] Chengsheng Yuan,et al. A Survey of Image Information Hiding Algorithms Based on Deep Learning , 2018, Computer Modeling in Engineering & Sciences.

[3] Jia Liu,et al. Polyphenolic profile of Origanum vulgare L. ssp. viridulum from Argentina. , 2014 .

[4] Franck Vermet,et al. Statistical Learning Methods , 2018 .

[5] Zhaoquan Gu,et al. Automatic Non-Taxonomic Relation Extraction from Big Data in Smart City , 2018, IEEE Access.

[6] Wendy G. Lehnert,et al. Information extraction , 1996, CACM.

[7] Deming Zeng,et al. Optimal Model of Continuous Knowledge Transfer in the Big Data Environment , 2018, Computer Modeling in Engineering & Sciences.