Information Extraction from Web Pages Based on Improved DSE Algorithm

Along with the rapid development of Internet technology and,more and more people begin to realize the importance of internet as a huge information source.The most important problem to solve in web information extraction is extracting and organizing the information from the internet automatically and effectively.Based on the DSE algorithm and the RoadRunner system to explore and improve the algorithm,we propose a new automated information extraction methods to generate the template and the template page with the url in determining the threshold into a bioinformatics approach in the FDR for the determination of the threshold proposed theoretical basis.Experimental results show that the improved extraction method for the extraction of the accuracy of the results of significant improvement.