A Generalized Hidden Markov Model Approach for Web Information Extraction

A generalized hidden Markov model (GHMM) which extends traditional HMMs by making use of Web-specific information for Web information extraction is presented in this paper. Web content blocks are used instead of content terms as basic extraction unit in our approach. Besides, instead of using the traditional sequential state transition order, the state transition orders of GHMMs are detected based on layout structures of the corresponding Web pages. Furthermore, multiple emission features are applied instead of single emission feature. In this way GHMMs can better accommodate Web information extraction. Experiments show promising results of GHMMs

[1]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[2]  Andrew McCallum,et al.  Information Extraction with HMMs and Shrinkage , 1999 .

[3]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[4]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[5]  Tim Leek,et al.  Information Extraction Using Hidden Markov Models , 1997 .

[6]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[7]  Andrew McCallum,et al.  Information Extraction with HMM Structures Learned by Stochastic Optimization , 2000, AAAI/IAAI.

[8]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Wei-Ying Ma,et al.  Visual Based Content Understanding towards Web Adaptation , 2002, AH.

[11]  Angelo Di Iorio,et al.  Rule-Based Structural Analysis of Web Pages , 2004, Document Analysis Systems.

[12]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[13]  Adam Berger,et al.  Automatic Partitioning of Web Pages Using Clustering , 2004, Mobile HCI.

[14]  Jun Yang,et al.  AUTOBIB: automatic extraction of bibliographic information on the Web , 2004, Proceedings. International Database Engineering and Applications Symposium, 2004. IDEAS '04..

[15]  Min Song,et al.  Integrating Text Chunking with Mixture Hidden Markov Models for Effective Biomedical Information Extraction , 2005, International Conference on Computational Science.

[16]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[17]  Robert M. Gray,et al.  Image classification by a two dimensional hidden Markov model , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[18]  Mark Craven,et al.  Hierarchical Hidden Markov Models for Information Extraction , 2003, IJCAI.

[19]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[20]  Kazem Taghva,et al.  Address extraction using hidden Markov models , 2005, IS&T/SPIE Electronic Imaging.

[21]  Wei-Ying Ma,et al.  Detecting web page structure for adaptive viewing on small form factor devices , 2003, WWW '03.

[22]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[23]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[24]  Veljko M. Milutinovic,et al.  Recognition of common areas in a Web page using visual information: a possible application in a page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..