Optimization of Hidden Markov Model by a Genetic Algorithm for Web Information Extraction

This paper demonstrates a new training method based on GA and Baum-Welch algorithms to obtain an HMM model with optimized number of states in the HMM models and its model parameters for web information extraction. This method is not only able to overcome the shortcomings of the slow convergence speed of the HMM approach. In addition, this method also finds better number of states in the HMM topology as well as its model parameters. From the experiments with the 2100 webs extracted from our corpus, this method is able to find the optimal topology in all cases. The experiments show that the GA-HMM approach has an average precision rate of 84.483% while the HMM trained by the Baum-Welch method has an average precision rate of 71.049%. This implies that the GA-HMM method is more optimized than the HMM trained by the Baum-Welch method.